Parallel preprocessing and distributed indexing configuration options

Different options can be used to set up parallel preprocessing and distributed indexing configurations. Configurations vary depending on the environment. For example, single or multiple JVMs, JVM allocated memory, hard disk size and speed, network latency for remote or local search servers or databases, and database configurations.
When you are evaluating configuration options, you must identify the bottleneck of the entire indexing process. The following items are some of the factors to consider:
  • Large amounts of data: Disk space must be maximized. Heap size and memory buffer size must also be changed, especially with a single JVM.
  • Disk I/O: A high number of shards results in more disk I/O operations. Therefore, a fast write disk type must be considered.
  • Database configurations.
  • Network latency.

Parallel preprocessing configurations

Preprocessing is typically affected by the database configurations, network latency, and the J2SE JVM allocated heap size.
  • Consider database tuning options, such as transactions cache, logger cache, and table space.
  • Tune the JVM heap size to set the minimum and maximum memory usage of the JVM used for the preprocessing. For example, Xms<memory> -Xmx<memory>.
  • Run each preprocessing job in a separate JVM (command-line session).

Distributed indexing configurations

The indexing configurations cover variations of single and multiple JVMs. They are typically affected by JVM heap size, disk space and speed, and network latency for a remote search server.

The following configurations can be used for distributed indexing:

Configuration 1: Single JVM

All shard index cores and the master index core are managed by a single Solr JVM. All shard's cores share JVM memory resources. It is important to calculate and allocate each of the cores' maximum memory buffer size according to the JVM maximum heap size.

The following steps help optimize and calculate each of the shards' memory resources:
  1. Determine the JVM heap size, and set the value to at least 2 GB.
  2. Determine the number of shards, and divide the overall JVM heap size over the number of shards. Then, add one to account for the master index. Remember this value.
  3. Set the solrconfig.xml ramBufferSizeMB property to the value obtained from the previous step.
  4. Disable any caches that are defined in the solrconfig.xml file for each of the shards, if enabled. For example, filterCache, queryResultCache, and documentCache.
  5. Optional: Disable any WebSphere Commerce components that are defined in the solrconfig.xml file. For example, comment out the following section:
    <arr name="components">
          <str>wc_query</str>
          <str>wc_facet</str>
          <str>mlt</str>
          <str>stats</str>
          <str>debug</str>
          <str>wc_spellcheck</str>
        </arr>
    Instead, use native Solr queries and facet components.

A variation of this configuration is to use different disks for each or some of the shards. For example, in cases where the disk I/O creates a bottleneck, or there are other disk constraints. Then, mount each of the shards under the Solr home into different disks.

When you merge the index, the master JVM must have access to the master index core and all of the shards' index cores directories.

Configuration 2: Multiple JVMs

Each of the shard index cores and the master index core are managed by different dedicated Solr JVMs. If some of the JVMs share disk and memory resources, the memory resources must be distributed across all of the shards.

A variation of this configuration is to use a dedicated disk and memory for each of the JVMs.

When you merge the index, the master JVM must have access to the master index core and all of the shards' index cores directories.