Unstructured content indexing and handling

The information that is stored in unstructured content can be organized and stored from several locations, including the HCL Commerce database, in file systems of servers, and on the internet. Therefore, the indexing process of unstructured content uses a hybrid of data sources to create indexing information by using the existing HCL Commerce Search indexing framework.

Important: HCL Commerce Search indexes decrypted unstructured data by default. That is, processing encrypted data with HCL Commerce Search is not supported.

Unstructured content organization and retrieval

Unstructured content includes catalog entry information and its associated attachments. HCL Commerce attachments are used to find the catalog entry-related attachment for a specific language. Since not all attachments contain a language_id, the default behavior merges the null language_id results with the specific language_id results.

Where the language_id is defined from the HCL Commerce table, -1 represents United States English and the catentry_id list is the input parameter. The result of the search contains the catentry_id, attachment path, attachment usage, and attachment description. The default attachment usage type list is DOCUMENTS, USERMANUAL, WARRANTY, and OTHER. For customization, these usage types are configurable to meet your specific search requirements.

Attachment types that accepted by the structured index are configurable in the wc-data-config.xml file of the unstructured core. They are located in the script snippet of the sample code. For example:


<script><![CDATA[
function isWriteToFile(row) {
var ruleName = row.get('RULENAME');
var writeToFile = "false";

if(ruleName != null){
if(ruleName == 'DOCUMENTS' || ruleName == 'USERMANUAL' || ruleName == 'WARRANTY'
|| ruleName == 'OTHER'){
writeToFile = "true";
}
}

row.put('writeToFile', writeToFile);
return row;
}

]]&gt;</script>

Where the rule names are in the check conditions.

Data Import Handler and indexing unstructured content

The data import handler handles the indexing process of unstructured content by using a hybrid of data sources to create indexing information that uses the existing HCL Commerce Search indexing framework. The TikaEntityProcessor is used to support the hybrid data.

The following diagram illustrates the role of the TikaEntityProcessor in handling unstructured content in HCL Commerce:

TikaEntityProcessor diagram

Where:

1, 2: Catalog entries are indexed from the HCL Commerce database as structured content.
3, 4, 5: The logic reuses the features of the DIH framework such as looping through the SQL result set rows and passing parameters in a format similar to ${Attachment.CATENTRY_ID}. The TikaEntityProcessor uses the sourceUrl parameter with commercebase parameters to fetch content from the internet, parses the binary content, and returns the results to the unstructured content index. Next, the TikaEntityProcessor appends the text content to a catentryId.txt file, which is located in the temp folder under the unstructured core root folder.

The commercebase parameters can be customized to meet your specific search requirements. The tikacontentfield and tikaprefix parameters are directly mapped to the fmap.content and uprefix Solr Cell parameters. For more information, see ExtractingRequestHandler.

The buildindex RESTful call is used to index the unstructured content.

Structured content data-import handler process updates to index unstructured content

Structured objects might need to be searched by the content of related unstructured content. Therefore, the unstructured content is also needed by structured content. That is, the default structured content DIH indexing process must also read the content of unstructured content. The PlainTextEntityProcessor is used to read the content of the temporary files and index them in the defined field. It uses the temporary files that the TikaEntityProcessor creates during the unstructured content DIH process.

The basePath data source in the DIH configuration file defines the temporary folder location. In this configuration, onError="continue" is set so that if the file does not exist, or there are other errors, the DIH process continues running and ignores the error. The column name of the unstructure field is a fixed value set as plainText and must not be changed.