Data Load parallelization

The data load utility is improved in HCL Commerce Version 9.1 to allow for parallelization. Parallelization allows for certain data load jobs to complete much faster, by increasing the number of threads that are used to load data into the database.

Important:

Performance tuning of the Data Load utility and your data is required to utilize this feature effectively. This feature can reduce overall Data Load performance. You must carefully consider the relationship between parallelization, its configuration, the type of data, and its particular structure.
Data Load parallelization is only compatible with CSV formatted data.

In previous versions of HCL Commerce, the data load utility was a single threaded application that was constrained by singleton classes which were not designed for parallel usage. This design limited the use of the utility for some large data jobs, hamstringing the performance of the tool. In some instances, users of the tool can find their ability to get work done impaired by long running jobs. With this new upgrade to the data load utility, multiple users of the tool can load data concurrently. In addition, shorter jobs can allow for future jobs to be run sooner.

Architecture

The architectural enhancements made to the data load utility include the addition of a queue where the reader of the CSV file creates batches of data to be processed. The queue has a maximum size, which when reached temporarily halts the reader from further production of batches. The reader thread will continue to enter batches of data into the queue as the batches are consumed. After all of the data is read from the input file, the reader thread will place an empty batch into the queue, and then exit with a data load summary report.

Each writer thread will remove one batch from the queue and process the batch to load data into the database. When a writer thread gets an empty batch from the queue, it will place the empty batch back into the queue and the writer thread will exit with a data load summary report.

Until all writer threads finish and exit, the Data Load utility will check if there are any errors from each writer thread. If there are reprocessing error CSV files created by the writer threads, the Data Load utility will merge all error reprocess CSV files into a single error reprocess CSV file, and then reload this CSV file using a single writer thread. Once all writer threads finish, the Data Load utility will produce a combined data load summary report.

Performance considerations and error handling

Due to the complex nature of loading hierarchical data with multiple threads, error handling must be carefully considered when enabling and configuration parallelization. This is especially true when it comes to performance tuning for your particular environment and dataset.

Warning: If your data contains many line items that reference the same data, SQL deadlocks can occur. Ensure that the data you are loading is clean of duplicate or contradictory entries, and structured in a way that avoids the potential for multiple writer threads from writing to the same SQL entries.

When implementing parallelization, consider the following as best practice:

Use the existing data load parameters commitCount, batchSize and maxError per LoadItem, to ensure your data load utility performance is dialed in.
Format your data to leverage parallelization appropriately. For example, if your data contains hierarchical data, place parent data together towards the beginning of the file. This will reduce the chances of attempting to load child data for which parent data is not yet present.

Configurable parameters and defaults

By default, the data load utility is set to run in single-thread mode. This ensures the same expected job behavior and performance as users have come to expect. The following new parameters have been added to control the parallelization of the data load utility.

A sample data load configuration that includes the full use of these parameters is available here.


Parameter	Value type	Default value	Description
`numberOfThreads`	Integer	1	The maximum number of individual writer threads that take batches of data from the queue, process them in order, and write the processed data into the database. By default, the `numberOfThreads` parameter is set to `1`, meaning the data load utility should run in single threaded (legacy) mode. The maximum number of threads is `8`. From internal performance testing, HCL recommends that the number of threads used be `4`. Use of more than four threads has shown to reduce overall load performance, and can result in errors such as the following: `com.ibm.commerce.foundation.dataload.exception.DataLoadApplicationException: A problem occurred while initializing the property information during the business object builder initialization.` If a number greater than `8` is provided, the maximum number of threads is used.
`inputDataListSize`	Integer	20	The maximum number of CSV line entries that is included in a batch of data to be added to the queue. Each writer thread handles a single batch of data from the queue. Once it is loaded, the thread is freed to process another batch from the queue. By default, the `inputDataListSize` parameter is set to `20`.
`queueSize`	Integer	`numberOfThreads`	The maximum number of batches that can exist in the queue. Once the queue is filled with the maximum number of batches, the reader waits for batches from the queue to be consumed before continuing to produce and queue further batches. By default, this is set to the `numberOfThreads` property value.
`multipleThreadsEnabled`	Boolean	false	Defines whether parallelization is enabled for the specific load item. By setting this parameter to `false` for a specific LoadItem, you override the set parallelization parameters and force the data load utility into single threaded operation. Manually set per LoadItem, if this parameter is not specified its default value of `false` is assumed.