Build index performance and capacity best practices for high-volume customers

A high-volume HCL Commerce implementation can include large catalog volume, large number of languages, and a limited amount of time to build the search index. By doing careful capacity planning and following performance best practices, you can optimize performance and capacity for HCL Commerce version 9.0.1.3 and higher.

Capacity planning best practices

For high volume users of HCL Commerce who want to build the search index in a shorter time frame, you must have enough hardware resources. The resources to consider include, but are not limited to, CPU, memory, storage (disk input/output), network bandwidth.

Important: The planning procedure and sample numbers that are described in this topic are for your reference. You need to tailor the procedure for your specific environment. Your HCL Commerce performance depends on the specific workload, data structure, and tuning options that you implement. Do performance testing to validate the implementation in your environment.

CPU resources

Building the search index is a procedure that uses intensive CPU calculation. To build an index with many languages in a short amount of time, you need to use parallel processing and sharding. The HCL Commerce search index build supports parallel processing and sharding. For more information, see https://help.hcltechsw.com/commerce/9.0.0/search/concepts/csdsearchparallel.html.

CPU resources is the first factor to be considered when you are planning to optimize performance, including the following best practices.

Determine the shard number
In general, the processing of different languages is done in parallel. With enough hardware resources, the duration of the total build index for multiple languages is the same as building one language. When sharding is used, the whole build index procedure is split into three stages: preprocessing, indexing, and merging. The duration of preprocessing and merging roughly equals the duration of indexing, and indexing speed is 1 million to 1.5 million documents per hour. This example uses a conservative estimate of 0.5 million documents per hour to calculate the total build index duration.
For many HCL Commerce implementations, there are significantly more catalog entry index documents than catalog group index documents. This example considers only the catalog entry index build because the parallel index build supports only the catalog entry core.
In this example, the build index catalog contains 6 million catalog entries (including products, SKU numbers, bundles, and other entries), and the target time frame for the build to complete is 3 hours. In this scenario, the necessary shard number is 6/3/0.5, which equals four shards. To shorten the build time to 1.5 hours, you must increase the shard number to 8 (6/1.5/0.5=8).
Determine the number of working threads and the virtual CPU (VCPU) number
In this example, HCL Commerce contains 6 million catalog entries, a 3-hour time frame to build the index, and 10 languages to build. In this scenario, you need 40 working threads (4 multiplied by 10) to do the parallel processing for indexing, which is the most intensive part of the whole procedure.
Note: Preprocessing requires more working threads and merging needs fewer working threads.
You also need to allocate 1-2 VCPU per each working thread. For more information, see https://help.hcltechsw.com/commerce/9.0.0/admin/concepts/cpmdockertune.html. If you allocate two VCPU per working thread, processing is faster than allocating 1 VCPU per working thread.
Indexing is done in the Search server docker, so in this example you allocate 40-80 VCPU in total for the Search server dockers. If there are eight Search server dockers, each docker is allocated 5-10 VCPU.
Then, you can determine the CPU resource allocation for the Database server and Utility docker, which is where preprocessing runs. The best practice is to allocate 50 - 100 percent of the VCPU of the Search server dockers for the Database server (which equates to 40-80 VCPU in this scenario). Additionally, it is best practice to allocate 25 - 50 percent VCPU of the Search server dockers for the Utility docker (which equates to 20-40 VCPU for this scenario).
Consider batching the index build
If you have fewer available hardware resources and a more relaxed time frame, or if the total number of languages is too large to all be processed in parallel, you can consider batching the index build.
In the sample scenario (in which there are 6 million catalog entries and 10 languages), suppose the build index time frame is 6 hours instead of 3 hours. In this case, you can split the whole index build procedure into two batches, with each batch containing five languages. Splitting the procedure into these batches reduces the CPU resource to half of what was originally required. To batch the build, split the parallel preprocessing properties file into two separate files (each containing five languages) and write a batch file to fork the build index procedure one by one with the different properties file.

Storage resources

The input data that is used to build the index is stored in the Database server as a data file, while the output data of the index is stored in the Search server as an index file. Both the input and output files are stored in the physical file system, and when there is a large amount of data to build, the disk I/O is intensive. This I/O access during the index build is random access, and it is a best practice to use Solid State Disk (SSD) storage instead of Hard Disk Drive (HDD) storage for the database and search servers. SSD storage gives you higher I/O per second (IOPS) bandwidth.

In addition to IOPS bandwidth, total read and write bandwidth (MB per second) is also important. For the Database server, most of the disk I/O happens during preprocessing. For the Search server, most of the disk I/O occurs for merging. Each procedure must read a large amount of data from the disk and write a large amount of data in a short amount of time.

The actual IOPS and read/write bandwidth is determined by the average data size of each SKU (including identifier bits, attribute and value name length, description length, and so on). This example scenario provides a number for reference only; the calculation for your implementation must rely on your values.

For the HCL Commerce sample store, the average index file size for each catalog entry is 5 KB before merging and 3 KB after merging. For the sample customer with 6 million catalog entries, 10 languages, and a 3-hour time frame, the merging procedure accounts for approximately one-quarter of the whole procedure. In this scenario, the merging procedure takes approximately 45 minutes (2700 seconds). The merging procedure must read 300 GB (5 KB multiplied by 6 million catalog entries multiplied by 10 languages) from the disk and write 180 GB (3 KB multiplied by 6 million multiplied by 10 languages) to disk in 2700 seconds. The disk I/O is not evenly distributed during the merge procedure. Therefore, the allocated I/O bandwidth must be and least two times the predicted number.

Additionally, the I/O bandwidth that is allocated for the Database server should be several times more than the Search server. This allocation is due to the large number of temporary tables and transaction logs.

Important: The actual number is affected by performance tuning.

Memory resources

Even if you use high-speed SSD storage, the disk I/O operation is still much slower than if you use in-memory operation. Therefore, the best practice for memory resources for the Database and Search servers is that more is better. You achieve better performance if the Database server has memory to contain the whole database file in memory and the Search server is able to contain the whole index file in memory. In general, less memory resource causes more disk I/O operations and slows down the whole build index procedure.

Network resources

During the build index procedure, the large amount of data that is read and written to disk is also transferred through the network (between the Database server and the Utility docker, and between the Database server and the Search server).

For a high-volume customer, Gigabit Ethernet is required as the network connection between the Database server and the Utility docker, and between the Database server and the Search server. If you have more challenging requirements, it is a best practice to use multiple lines of connection or Ten Gigabit Ethernet to prevent bottlenecks from occurring during network transfers.

Performance tuning

Always use the latest version of HCL Commerce to use the most recent performance enhancements in the product code.

Additionally, follow these general best practices to achieve the optimal performance for high volume index build.

Important: These best practice recommendations are based on lab environment test results. Therefore, your actual tuning must be based on performance testing in your environment.

Preprocessing with the Utility docker
- Enable multi-thread preprocessing by setting Global.preprocessing-multithread=true in di-parallel-process.properties.
- Enable multi-thread preprocessing for all languages by setting Global.preprocessing-locale=en_US,ja_JP,de_DE (comma separated locales).
  Important: If this parameter is set as Global.preprocessing-locale=all, then the languages will be processed sequentially.
- Tailor the preprocessing configuration XML files to control the number of parallel working threads. For more information, see the Controlling the number of preprocessing working threads section.
- Use CopyColumnsDataPreProcessor for most tables. CopyColumnsDataPreProcessor is the default for HCL Commerce version 9.0.1.3 and later.
- For Db2®, disable the transaction log for CopyColumnsDataPreProcessor.
- Increase the fetchSize and batchSize number for non-CopyColumns DataPreProcessor.
Indexing with the Search server docker
- Increase batchSize (solr.dih.batchSize), which is defined as a global JVM option or core-level parameter in the SRCHCONFEXT table in the database.
Database server
- For Db2®, enable intra-partition parallelism for by setting INTRA_PARALLEL=YES in dbm cfg.

Controlling the number of preprocessing working threads

According to lab environment test results, the Database server is not linear-scalable for parallel processing during preprocessing. After the saturation point occurs, if you increase the number of parallel working threads it might cause a negative performance impact. When you enable intra-partition parallelism in Db2®, there are multiple working threads in the Db2® server for each preprocessing working thread. Therefore, it is important to control the number of preprocessing working threads to achieve optimal build index performance.

When multi-thread preprocessing is disabled, there is one preprocessing working thread per shard and per language. But, when multi-thread preprocessing is enabled, there are multiple preprocessing working threads per shard and per language. The HCL Commerce product code behavior is to create one working thread for each preprocessing configuration XML (such as wc-dataimport-preprocess-attribute.xml). The sample preprocessing XML files are organized by table relationship rather than by performance consideration. You can split or merge different XML files based on your performance considerations.

From version 9.0.1.2+, preprocessing is split into two phases. The first phase is for processing language-independent tables (which happens only once). The second phase is for processing language-dependent tables, in which all languages are processed in parallel.

For the first phase, total working thread number equals XML number multiplied by shard number. For the second phase, total working thread number equals XML number multiplied by shard number multiplied by language number. When there are many languages for parallel processing, the working thread number can be different between the two phases.

In general, you can split language-independent table XML files and merge language-dependent table XML files to balance the working thread number of the two phases. For example, all language-dependent table XML files are merged into one XML. This way, the total working thread number in the second phase equals shard number multiplied by language number.