Indexing site content with WebSphere Commerce Search

WebSphere Commerce contains unmanaged content such as site content, that must be crawled using the site content crawler. Unmanaged content intended for production must be published separately, as it is not part of staging propagation. Once the static content is copied to the correct location, a manual site content re-indexing from the production system is required against the repeater.

Site content crawler

The site content crawler crawls HTML and other site files from WebSphere Commerce starter stores to help populate the site content search index.

The site content crawler captures the site content, caches it in a local directory, and puts the entries into the manifest.txt file. It then maps the physical locations to their corresponding URLs. The indexer uses the manifest file to retrieve the physical temporary file locations, creates the indexes, and once tokenized, associates the file URLs with the index record.

The following table highlights the site content crawler workflow:
Site content crawler actions and workflow
Site content crawler action Site content crawler workflow
Site content crawler launches The site content crawler:
  1. Reads the site content crawler configuration files
  2. Reads the host filter configuration files
  3. Initializes the site crawler internal parameters
Site content crawler creates directory structure The site content crawler:
  1. Locates the destination directory from the configuration
  2. Creates the date directory under the destination directory
  3. Creates the counter directory under the date directory
The following diagram depicts a high-level overview of the site content crawler directory structure:
Site content crawler directory structure
Site content crawler crawls site content The site content crawler:
  1. Reads from the URLs pool
  2. Crawls site content files
  3. Extracts URL links
  4. Filters URL links
  5. Adds URLs to the URL pool
Site content crawler completes If the site content crawler is successful, it:
  1. Saves the site content to the current counter directory
  2. Adds an entry to the manifest.txt file
If the site content crawler fails, it:
  1. Adds an entry to the errors.txt file.

Site content crawler and indexer integration

The indexer acts as a service to the site content crawler. After each crawl completes, the site content crawler directly invokes a request to the WebSphere Commerce Search server with the specific URL. The indexing process then starts asynchronously. The typical URL resembles the following sample URL:
  • http://localhost/solr/unstructured_core_name/webdataimport?command=full-import&basePath=path_to_directory_of_manifest_file_with_path_separator_appended
The URL is coded in the site content crawler configuration file.