Introduced in Feature Pack 2

Troubleshooting: Preventing external URLs from being requested

You can prevent external URLs from being requested for the unstructured search index for performance benefits. This is done by excluding the attachmentURL entity in requests.

Problem

Indexing performance suffers if your site requests external URLs for the unstructured search index.

Solution

To exclude external URLs:
  1. Update the wc-dataimport-preprocess-unstructured-content.xml file, noting the following changes in bold:
    
    <_config:data-processing-config processor="com.ibm.commerce.foundation.dataimport.preprocess.StaticAttributeDataPreProcessor" masterCatalogId="10001" batchSize="500">
        <_config:table definition="CREATE TABLE  TI_CEATCHTU_0_#lang_tag# (ATCHREL_ID VARCHAR(128) NOT NULL, ATCHAST_ID BIGINT NOT NULL, CATENTRY_ID BIGINT NOT NULL, IDENTIFIER VARCHAR(128) NOT NULL, NAME VARCHAR(128), SHORTDESCRIPTION  VARCHAR(254), LONGDESCRIPTION VARCHAR(4000), ATCHASTPATH VARCHAR(520), DIRECTORY VARCHAR(254), DIRECTORYPATH VARCHAR(128), MIMETYPE VARCHAR(254), LANGUAGE_ID INTEGER, IMAGE VARCHAR(254), RULENAME VARCHAR(254), PRIMARY KEY (ATCHREL_ID))" name=" TI_CEATCHTU_0_#lang_tag#"/>
        <_config:query sql="SELECT rtrim(cast(ATCHREL.atchrel_id as char(23))) ||'_'|| rtrim(cast(ATCHAST.atchast_id as char(23))) as atchrel_id, CE.CATENTRY_ID, ATCHAST.atchast_id,  ATCHTGT.identifier, ATCHRELDSC.name, ATCHRELDSC.shortdescription, ATCHRELDSC.longdescription,
        rtrim(ATCHAST.atchastpath) atchastpath, STORE.directory, ATCHAST.directorypath, ATCHAST.mimetype, ATCHASTLG.language_id, ATCHRLUS.Image, ATCHRLUS.identifier rulename
    
        FROM TI_CATENTRY_0 CE
        JOIN ATCHREL ON ATCHREL.BIGINTOBJECT_ID = CE.CATENTRY_ID
        JOIN ATCHOBJTYP ON (ATCHREL.ATCHOBJTYP_ID = ATCHOBJTYP.ATCHOBJTYP_ID AND ATCHOBJTYP.IDENTIFIER = 'CATENTRY')
        LEFT JOIN ATCHTGT on (ATCHREL.atchtgt_id = ATCHTGT.atchtgt_id )
        JOIN ATCHAST on (ATCHAST.atchtgt_id = ATCHTGT.atchtgt_id AND ATCHAST.MIMETYPEENCODING = 'URL')
        LEFT JOIN ATCHASTLG on (ATCHASTLG.atchast_id = ATCHAST.atchast_id) 
        LEFT JOIN ATCHRELDSC on (ATCHRELDSC.atchrel_id = ATCHREL.atchrel_id AND ATCHRELDSC.language_id=?language_id?)
        JOIN ATCHRLUS ON (ATCHREL.ATCHRLUS_ID = ATCHRLUS.ATCHRLUS_ID)
        LEFT JOIN STORE on (ATCHAST.storeent_id = STORE.store_id)
        WHERE  (ATCHASTLG.atchastlg_id is null or ATCHASTLG.language_id=?language_id?) order by ATCHREL.sequence, ATCHREL.atchrel_id"/>
        <_config:mapping>
          <_config:key queryColumn="ATCHREL_ID" tableColumn="ATCHREL_ID"/>
          <_config:column-mapping>
           <_config:column-column-mapping>
             <_config:column-column queryColumn="ATCHAST_ID" tableColumn="ATCHAST_ID" />
             <_config:column-column queryColumn="CATENTRY_ID" tableColumn="CATENTRY_ID" />
             <_config:column-column queryColumn="IDENTIFIER" tableColumn="IDENTIFIER" />
             <_config:column-column queryColumn="NAME" tableColumn="NAME" />
             <_config:column-column queryColumn="SHORTDESCRIPTION" tableColumn="SHORTDESCRIPTION" />
             <_config:column-column queryColumn="LONGDESCRIPTION" tableColumn="LONGDESCRIPTION" />
             <_config:column-column queryColumn="ATCHASTPATH" tableColumn="ATCHASTPATH" />
             <_config:column-column queryColumn="DIRECTORY" tableColumn="DIRECTORY" />
             <_config:column-column queryColumn="DIRECTORYPATH" tableColumn="DIRECTORYPATH" />
             <_config:column-column queryColumn="MIMETYPE" tableColumn="MIMETYPE" />
             <_config:column-column queryColumn="LANGUAGE_ID" tableColumn="LANGUAGE_ID" />
             <_config:column-column queryColumn="IMAGE" tableColumn="IMAGE" />
             <_config:column-column queryColumn="RULENAME" tableColumn="RULENAME" />
            </_config:column-column-mapping>
          </_config:column-mapping>
        </_config:mapping>
      </_config:data-processing-config>
    
  2. Update the TI_CEATCHT_0 table, noting the following change in bold:
    
    JOIN ATCHAST on (ATCHAST.atchtgt_id = ATCHTGT.atchtgt_id AND ATCHAST.mimetype &lt&gt; '' AND ATCHAST.MIMETYPEENCODING &lt;&gt; 'URL')
  3. Update the wc-data-config.xml file, noting the following changes in bold:
    
    <entity name="AttachmentURL-test"
     dataSource="WC database"
     query="SELECT I_CATENTATCHT_1.CATENTRY_ID, I_CATENTATCHT_1.ATCHAST_ID, I_CATENTATCHT_1.ATCHREL_ID, I_CATENTATCHT_1.IDENTIFIER, I_CATENTATCHT_1.ATCHASTPATH, I_CATENTATCHT_1.MIMETYPE, I_CATENTATCHT_1.NAME, I_CATENTATCHT_1.IMAGE, I_CATENTATCHT_1.RULENAME, I_CATENTATCHT_1.SHORTDESCRIPTION, I_CATENTATCHT_1.LONGDESCRIPTION, I_CATENTATCHT_1.DIRECTORY
         FROM  TI_CEATCHTU_0_1 I_CATENTATCHT_1"
     deltaImportQuery="SELECT I_CATENTATCHT_1.CATENTRY_ID, I_CATENTATCHT_1.ATCHAST_ID, I_CATENTATCHT_1.ATCHREL_ID, I_CATENTATCHT_1.IDENTIFIER, I_CATENTATCHT_1.ATCHASTPATH, I_CATENTATCHT_1.MIMETYPE, I_CATENTATCHT_1.NAME, I_CATENTATCHT_1.IMAGE, I_CATENTATCHT_1.RULENAME, I_CATENTATCHT_1.SHORTDESCRIPTION, I_CATENTATCHT_1.LONGDESCRIPTION, I_CATENTATCHT_1.DIRECTORY
         FROM  TI_CEATCHTU_0_1 I_CATENTATCHT_1"
     deltaQuery="SELECT ATCHREL_ID FROM  TI_CEATCHTU_0_1 FETCH FIRST 1 ROWS ONLY"
     transformer="script:isWriteToFile"
    >
     <field column="ATCHREL_ID" name="attachmentrel_id" />
     <field column="ATCHAST_ID" name="attachment_id" />
     <field column="CATENTRY_ID" name="catentry_id" />
     <field column="IDENTIFIER" name="identifier" />
     <field column="NAME" name="name" />
     <field column="IMAGE" name="image" />
     <field column="RULENAME" name="rulename" />
     <field column="SHORTDESCRIPTION" name="shortdesc" />
     <field column="LONGDESCRIPTION" name="longdesc" />
     <field column="ATCHASTPATH" name="path" />
     <field column="MIMETYPE" name="mimetype" />
    </entity>
    
To filter based on MIME type, for example images:

   SELECT I_CATENTATCHT_1.CATENTRY_ID, I_CATENTATCHT_1.ATCHAST_ID,
I_CATENTATCHT_1.ATCHREL_ID, I_CATENTATCHT_1.IDENTIFIER,
I_CATENTATCHT_1.ATCHASTPATH, I_CATENTATCHT_1.MIMETYPE,
I_CATENTATCHT_1.NAME, I_CATENTATCHT_1.IMAGE, I_CATENTATCHT_1.RULENAME,
I_CATENTATCHT_1.SHORTDESCRIPTION, I_CATENTATCHT_1.LONGDESCRIPTION,
I_CATENTATCHT_1.DIRECTORY
    FROM TI_CEATCHT_0_1 I_CATENTATCHT_1 WHERE MIMETYPE NOT LIKE
'image/%' with ur;

The new Solr document reflects the updated exclusions and increases the performance of the preprocess and build index scripts.