The indexing process

The indexing process involves adding Documents to an IndexWriter. The searching process involves retrieving Documents from an index using an IndexSearcher. Solr can index both structured and unstructured content.

Structured content is well organized. For example, some of the product description's predefined fields are title, manufacture name, description, and color.

Unstructured content, in contrast, lacks structure and organization. For example, it can consist of PDF files or content from external sources (such as tweets) that do not follow any predefined patterns.

Data Import Handler

The data import handler can perform full import or delta imports. When the full-import command is run, it stores the start time of the operation in the dataimport.properties file, located in the same directory as the solrconfig.xml file. For example:


<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
   <lst name="defaults">
     <str name="config">wc-data-config.xml</str>
    <str name="update.chain">wc-conditionalCopyFieldChain</str>
   </lst>
  </requestHandler>

Fetching, reading, and processing data

The wc-data-config.xml file defines the following behaviors:

How to fetch data, such as using queries or URLs.
What to read, such as resultset columns or XML fields.
How to process, such as modifying, adding, or removing fields.

For example: the solrhome\MC_10001\en_US\CatalogEntry\conf\wc-data-config.xml file contains the following content:


<dataConfig>

  <dataSource name="WC database" 
          type="JdbcDataSource" 
          jndiName="jdbc/WCDB"
          readOnly="true"
          autoCommit="true"
          transactionIsolation="TRANSACTION_READ_COMMITTED"   
          holdability="CLOSE_CURSORS_AT_COMMIT"       
  />

  <dataSource name="unstructuretmpfile" 
          type="FileDataSource" 
          basepath="W:\WCDE_I~1\search\solr\home\MC_10001_\en_US\CatalogEntry\unstructured\temp/"
  />

Where two data sources are used:

The WebSphere Commerce database is the data source for structured data.
The unstructuretmpfile specifies the path to the unstructured data.

In addition, the file contains the following types of content by default:

The following three documents exist: one for CatalogEntry, one for bundle, and one for dynamic kit.

The CatalogEntry document contains the following entities: Product and attachment_content.

The Product entity contains the following parameters:

query

Identifies the data to populate fields of the Solr document when running full imports.

deltaImportQuery

Identifies the data to populate fields when running delta imports.

deltaQuery

Identifies the primary keys of the current entities that have changed since the last index time.

deletedPkQuery

Identifies the documents which should be removed.

Transformer

Every set of fields that are fetched by the entity can be consumed either directly by the indexing process, or massaged using transformers to modify a field or create a new set of fields.

Transformers are chained and sequentially applied in the order in which they are specified. After the fields are fetched from the datasource, the list of entity columns are processed one at a time in the order listed inside the entity tag. Then, they are scanned by the first transformer to see if any of the transformer's attributes are present. If present, the transformer is run.

When all of the listed entity columns have been scanned, the process is repeated using the next transformer in the list. A transformer can be used to alter the value of a field fetched from the datasource or to populate an undefined field. In the preceding example, the following transformers are used:

Regex Transformer: Extracts or manipulates values from fields (from the source) using Regular Expressions.
ClobTransformer: Creates a String out of a Clob type in the database.
NameValuePairTransformer: Dynamically creates new fields that are then used as attributes.

The wc-data-config.xml file also contains column-to-field mappings that specify the relationship between index field names and database column names. For example:


<field column="CATENTRY_ID" name="catentry_id" />
<field column="MEMBER_ID" name="member_id" />
<field column="CATENTTYPE_ID" name="catenttype_id_ntk_cs" />
<field column="PARTNUMBER" name="partNumber_ntk" />

Where: CATENTRY_ID is the database column name and catentry_id is the index field name.

Setting up and building the search index

WebSphere Commerce uses the following utilities to set up and build the search index:

setupSearchIndex, which is run once per master catalog. For more information, see Setting up the search index.
di-preprocess, which extracts and flattens WebSphere Commerce data and then outputs the data into a set of temporary tables inside the WebSphere Commerce database. The data in the temporary tables is then used by the index building utility to populate the data into search indexes by using the Data Import Handler (DIH).
di-buildindex, which crawls the temporary tables that are populated by the preprocess utility and then populates the Solr index.
For more information, see Preprocessing and building the search index.

Crawling unstructured content

For unstructured content, Solr's ExtractingRequestandler uses Apache Tika to allow users to upload binary files and unstructured data to Solr. Then, Solr extracts and indexes the content.

WebSphere Commerce uses the Droid site content crawler to crawl the web, and put the content into the file. That is, from the unstructuretmpfile path specified in the wc-data-source.xml file within the CatalogEntry index. Then, Tika parses this file and the information is indexed by the DIH. Unstructured data comes from two sources: the database and the crawler. The unsturctured index contains two data configuration files: The wc-data-config.xml file contains product attachments, such as PDF files, while the wc-web-data-config.xml contains web content.

Note: All unstructured content must not be encrypted, so that it can be indexed and crawled correctly.

For example, the solrconfig.xml file contains the following content:


<!-- Solr Cell Update Request Handler

     http://wiki.apache.org/solr/ExtractingRequestHandler

   -->
  <requestHandler name="/update/extract"
            startup="lazy"
            class="solr.extraction.ExtractingRequestHandler" >
   <lst name="defaults">
    <!-- All the main content goes into "text"... if you need to return
        the extracted text or do highlighting, use a stored field. -->
    <str name="fmap.content">text</str>
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>

    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
   </lst>
  </requestHandler>

Where:

Tika automatically determines the input document type and produces an XHTML stream that is then fitted to a SAX ContentHandler.
Solr then reacts to Tika's SAX events and creates the fields to index.
Tika produces metadata information such as Title, Subject, and Author.
All of the extracted text is added to the content field. Setting Fmap.content to text causes the content to be added to the text field.

For more information about unstructured content, see Unstructured and site content.

For more information about the WebSphere Commerce index schema, see WebSphere Commerce search index schema and WebSphere Commerce search index schema definition.