Site content crawler configuration

The site content crawler uses configuration files and manifest files to determine the site content crawler behavior.

You can start the site content crawler by accessing the following URL:
http://searchHost:port/search/admin/resources/crawler?action=start&langId=langId&storeId=storeId&catalogId=catalogId
Mandatory context parameters
langId
Internal numeric identifier that represents the language, for instance, -1 for English.
storeId
Internal numeric identifier that represents the store, for instance, 10001.
catalogId
Internal numeric identifier that represents the catalog, for instance, 10001.

The following two variables are automatically populated by the runtime engine: hostname and portnum, which will be the store server hostname and its port number, respectively.

The site content crawler relies on the following input configuration files, which are in the following directory, Liberty/usr/servers/searchServer/resources\search\index\crawler\ext\:
droidConfig.xml
The site content crawler configuration file contains variables and parameters that determine the site content crawler behavior. The variables that are specified in the site content crawler configuration file are then used to populate values further in the configuration file.
Parameters
initialLocations
The starting URL for the site content crawler.
Important: You must update the starting URL for the site content crawler to operate correctly.
For example:
https://${hostname}:${portnum}/shop/StaticContentSitemap?storeId=1&langId=-1&catalogId=10502
relativePath
If specified, the relative path is omitted from the URLs added into the manifest file. For example:

4,StaticContent/Recipe.html,8fa661c4-f812-4b3c-aa5c-361894120d23.html,text/html,UTF-8,A,3 
If not specified, then an absolute path is set in the URLs. For example:

4,http://wcsolr05/webapp/wcs/stores/servlet/StaticContent/Recipe.html,5b770798-cd9a-478d-9fb3-b75c1e1c3b91.html,text/html,UTF-8,A,6 
It is important to set the relative path so that production environments do not point to the staging server, but rather point to itself.
depth
The maximum depth the crawler crawls. A value of -1 denotes no depth restrictions.
max
The maximum number of pages to crawl. A value of -1 denotes no maximum.
delay
The delay time in milliseconds between each HTTP request.
filters
The host filter configuration file location.
threadmode
The site content crawler thread mode.
0
Single thread mode
1
Multiple thread mode
maxthread
The number of threads to create when in the multiple thread mode.
autoIndex
Indicates whether to enable automatic indexing of site content after the content is crawled.
skipDownload
Indicates which URLs to not add into the manifest.txt file, therefore not indexing them. For example, StaticContentSitemap.jsp:

http://${hostname}/webapp/wcs/stores/servlet/StaticContentSitemap?storeId=${storeId}&langId=${langId}&catalogId=${catalogId}
jndiName
The jndi name of the JDBC data source for example, <jndiName>jdbc/jndiName</jndiName>. It is only used when you run the crawler through URL. When this parameter is specified, the crawler can use that data source to update the database after the crawling finishes.
filters.txt
The filters configuration file determines whether URLs are included or ignored by the site content crawler.
You can update the filters configuration file by using regular expressions to include or ignore values.
Important: You must update the filters configuration file to include your HCL Commerce host name.
The default sample values contain ignores such as excluding URLs containing email or FTP links, or pages that require logging on to the site.
SiteMap.jsp
The site map, which is used by web browsers and external search engines, contains pointers to the different starter store pages
StaticContentSitemap.jsp
The static site map contains pointers to the static content files that are in the HCL Commerce database.
The URL that is passed from the configuration file to the site content crawler is:
http://host_name/webapp/wcs/stores/servlet/StaticContentSitemap?storeId=storeId&langId=-1&catalogId=catalogId
You must update the static site map file to include your additional static content files that are in the HCL Commerce database.

This file is used only by the site content crawler.

Site content crawler manifest files
The site content crawler manifest.txt output files are comma-separated values (CSV) formatted documents that contain generated information. You can find the files in the directory searchServerPath\resources\search\index\crawler\cache\date\number,where:
date
Is the date when the crawler utility was run.
number
Means the number of times the crawler was run, starting with 1.
  1. The manifest file that indicates which folder contains the downloaded site content files. It contains the following columns:
    Timestamp
    The time stamp for the column.
    Directory path
    The counter directory path.
    Initial location URLs
    The initial URLs separated by a comma.
  2. The manifest file that contains the mappings of downloaded files to URLs. It contains the following columns:
    ID
    The ID that distinguishes each file in the document. For example, a simple sequence.
    URL
    The relative URL to the current store, or full URL pointing to external resources.
    Local file path
    The file path, either in full format or relative format, of the stored site content.
    Content-type
    The content type of the file for example, text/html.
    Encoding
    The encoding of the file, if it is a text-based file.