Unstructured content indexing and handling

The information stored in unstructured content can be organized and stored from several locations, including the WebSphere Commerce database, in file systems of servers, and on the Internet. Therefore, the indexing process of unstructured content uses a hybrid of data sources to create indexing information using the existing WebSphere Commerce Search indexing framework.

Important: WebSphere Commerce Search indexes decrypted unstructured data by default. That is, processing encrypted data with WebSphere Commerce Search is not supported.

Unstructured content organization and retrieval

Unstructured content includes catalog entry information and their associated attachments. WebSphere Commerce attachments are used to find the catalog entry related attachment for a specific language. Since not all attachments contain a language_id, the default behavior merges the null language_id results with the specific language_id results. For example, the following SQL statement results in this search:


SELECT ATCHREL.atchrel_id, CE.CATENTRY_ID, ATCHAST.atchast_id, ATCHTGT.identifier, ATCHTGTDSC.name, ATCHTGTDSC.shortdescription, ATCHTGTDSC.longdescription,
ATCHAST.atchastpath , STORE.directory, ATCHAST.directorypath, ATCHAST.mimetype, ATCHASTLG.language_id, ATCHRLUS.Image, ATCHRLUS.identifier rulename
FROM TI_CATENTRY_0 CE
JOIN ATCHREL ON ATCHREL.BIGINTOBJECT_ID = CE.CATENTRY_ID
JOIN ATCHOBJTYP ON (ATCHREL.ATCHOBJTYP_ID = ATCHOBJTYP.ATCHOBJTYP_ID AND ATCHOBJTYP.IDENTIFIER = 'CATENTRY')
LEFT JOIN ATCHTGT on (ATCHREL.atchtgt_id = ATCHTGT.atchtgt_id )
LEFT JOIN ATCHAST on (ATCHAST.atchtgt_id = ATCHTGT.atchtgt_id)
LEFT JOIN ATCHASTLG on (ATCHASTLG.atchast_id = ATCHAST.atchast_id)
LEFT JOIN ATCHTGTDSC on (ATCHTGTDSC.atchtgt_id = ATCHTGT.atchtgt_id AND ATCHTGTDSC.language_id=?language_id?)
JOIN ATCHRLUS ON (ATCHREL.ATCHRLUS_ID = ATCHRLUS.ATCHRLUS_ID)
LEFT JOIN STORE on (ATCHAST.storeent_id = STORE.store_id)
WHERE (ATCHASTLG.atchastlg_id is null or ATCHASTLG.language_id=?language_id?) order by ATCHREL.atchrel_id

Where the language_id is defined from the WebSphere Commerce table, -1 represents United States English and the catentry_id list is the input parameter. The result of the search contains the catentry_id, attachment path, attachment usage, and attachment description. The default attachment usage type list is DOCUMENTS, USERMANUAL, WARRANTY, and OTHER. For customization, these usage types are configurable to meet your specific search requirements.

Attachment types that accepted by the structured index are configurable in the wc-data-config.xml file of the unstructured core. They are located in the script snippet of the sample code. For example:


<script><![CDATA[
function isWriteToFile(row) {
var ruleName = row.get('RULENAME');
var writeToFile = "false";

if(ruleName != null){
if(ruleName == 'DOCUMENTS' || ruleName == 'USERMANUAL' || ruleName == 'WARRANTY'
|| ruleName == 'OTHER'){
writeToFile = "true";
}
}

row.put('writeToFile', writeToFile);
return row;
}

]]&gt;</script>

Where the rule names are in the check conditions.

Content configuration for the preprocess utility

The preprocess utility extracts and flattens WebSphere Commerce data and then outputs the data into a set of temporary tables inside the WebSphere Commerce database. The data in the temporary tables is then used by the index building utility to populate the data into search indexes using the Data Import Handler (DIH).

The preprocess utility picks the wc-dataimport-preprocess-fullbuild.xml file or wc-dataimport-preprocess-deltaupdate.xml file first, and then transforms the results of the SQL statements defined in those files into temporary tables. Next, the utility handles each configuration XML file in a random order.

Unstructured content preprocessing is a language-specific process, where an unstructured content configuration file is used. The attachments and catalog entry information's relationship is stored in the generated table for further DIH retrieval.

See di-preprocess utility for more information.

Data Import Handler and indexing unstructured content

The data import handler handles the indexing process of unstructured content using a hybrid of data sources to create indexing information using the existing WebSphere Commerce Search indexing framework. The TikaEntityProcessor is used to support the hybrid data.

The following diagram illustrates the role of the TikaEntityProcessor in handling unstructured content in WebSphere Commerce:

TikaEntityProcessor diagram

Where:

1, 2: Catalog entries are indexed from the WebSphere Commerce database as structured content.
3, 4, 5: The logic reuses the features of the DIH framework such as looping through the SQL result set rows and passing parameters in a form resembling the following: ${Attachment.CATENTRY_ID}. The TikaEntityProcessor uses the sourceUrl parameter with commercebase parameters to fetch content from the Internet, parses the binary content, and returns the results to the unstructured content index. Next, the TikaEntityProcessor appends the text content to a catentryId.txt file, located in the temp folder under the unstructured core root folder.

The commercebase parameters can be customized to meet your specific search requirements. The tikacontentfield and tikaprefix parameters are directly mapped to the fmap.content and uprefix Solr Cell parameters. For more information, see ExtractingRequestHandler.

The di-buildindex utility is used to index the unstructured content.

Structured content data import handler process updates to index unstructured content

Structured objects might need to be searched by the content of related unstructured content. Therefore, the unstructured content is also needed by structured content. That is, the default structured content DIH indexing process must also read the content of unstructured content. Based on the temporary files that the TikaEntityProcessor creates during the unstructured content DIH process, the PlainTextEntityProcessor is used to read the content of the temporary files and index them in the defined field.

The basePath data source in the DIH configuration file defines the temporary folder location. In this configuration, onError="continue" is set so that if the file does not exist, or there are other errors, the DIH process continues running and ignores the error. The column name of the unstructure field is a fixed value set as plainText and must not be changed.