General data load best practices

The following best practices are recommended when you use the Data Load utility to load data.

To ensure better performance, problem determination, and maintenance, consider the following best practices:

Configuration for the initial loads
Configuration for the delta loads
Running the data load script file
File difference preprocessing
Data Load utility configuration files
Configuring the data load order file (wc-dataload.xml)
Configuring the data load environment configuration file (wc-dataload-env.xml)
Configuring the data load business object configuration file
CSV input files
Loading by unique ID
Reversing a data load
Tuning the Data Load utility

Configuration for the initial loads

For more information about recommended configurations during initial loads, see Scenario: Initial load.

Configuration for the delta loads

For more information about recommended configuration during delta loads, see Scenario: Delta load.

Running the data load script file

If you run the Data Load utility from the Utilities container, consider mounting a directory of the host system in a directory within the container. Place all your Data Load files in a subdirectory of the mounted directory. This approach ensures that the files are not lost when the container is overwritten by another Docker image. It can also be easier to edit files directly in the host system using your favorite editor.

When running the utility , consider the following command-line options:

-DXmlValidation=false

Turn off XML validation if you are using some variable substitution for integer attributes.

-D.level=FINEST

Turn on more tracing in the utilities_root\logs\wc-dataload.log file.

For large loads, specifying FINEST trace level, causes too much tracing in the log file. You can turn on tracing for one package as shown in the following examples.

-Dcom.ibm.commerce.catalog.dataload.level=FINER: Finer log level for catalog information.
-Dcom.ibm.commerce.foundation.dataload.database.level=FINE: Log SQL issues.
-Dcom.ibm.commerce.foundation.dataload.level=CONFIG: Log Data Load performance issues.
-Dcom.ibm.commerce.price.dataload.level=FINER: Log price-related load issues
-Dcom.ibm.commerce.inventory.dataload.level=FINER: Log inventory-related load issues.
-Dcom.ibm.commerce.member.dataload.level=FINER: Log member-related load issues.
-Dcom.ibm.commerce.marketing.dataload.level=FINER: Log marketing-related load issues.
-Dcom.ibm.commerce.promotion.dataload.level=FINER: Log promotion-related load issues.
-Dcom.ibm.commerce.pagelayout.dataload.level=FINER: Log Commerce Composer-related load issues.

Customize the Java logging configuration file, utilities_root\ts.ear\XML\config\dataload\logging.properties. For example, you can change the log file path, the maximum log file size and how many log files to cycle through. By default, you have one log file, and the log file is overwritten every time you run the Data Load utility.

File difference preprocessing

You can run a file difference preprocess for routine data loads to improve the Data Load utility performance for loading these files. By using this preprocessor that you can compare two input files, such as a previously loaded file and a new version of this file. The preprocessor generates a difference file that contains only the records in the new file that are not within the old file or that are changed from the records in the old file. The Data Load utility can then load this difference file. If your routinely loaded files contain many previous loaded records, then running this file difference can result in shorter load times. Running a file difference can reduce the loading time that is required to load your routine updates to your HCL Commerce database, reduce server usage time, and improve server performance.

You can configure the Data Load utility file difference preprocessor to compare files by the values in each column, instead of entire records, to identify the changed records. You can also configure the file difference preprocessor to ignore specific columns when the process is comparing files.

For more information about this preprocessor, see Data Load file difference preprocessing.

Data Load utility configuration files

There are three types of data load configuration files:

Load order configuration file (wc-dataload.XML)

You can either have multiple load order configuration files or one load order configuration file to include all the load items. If you want to load only a few load items, use the following command line when you are running the Data Load utility:

-DLoadOrder="loadItemName1, loadItemName2, loadItemName3"

Environment configuration file (wc-dataload-env.XML)

You need only one copy of this configuration file.

Business object configuration files

Typically one business object configuration file corresponds to one type of input data that loads one type of business object. The file name convention is wc-loader-business object.XML. This file defines the DataReader, BusinessObjectBuilder, and BusinessObjectMediator used for the data load. It also defines all the customization configuration options.

Keep all your data load configuration files relative to the wc-dataload.XML file. Ensure that the configuration files that are specified in the wc-dataload.XML file use the relative path. This path can make it easy to move the configuration files from one workstation to another.

Configuring the data load order file (wc-dataload.xml)

Consider the following configurations:

Specify commitCount, batchSize, dataLoaderMode at the LoadOrder level, so you do not need to specify them at each LoadItem level.
Specify the commitCount to be greater than or equal to the batchSize. The commitCount is a multiple of the batchSize.
To minimize the impact to your production environment, specify the commitCount and batchSize to 1. Specifying a large commitCount and batchSize improves the data load performance. However, this large commitCount and batchSize might affect the database and result in more database tables and rows that are being locked for a longer time.
For easy debugging of some SQL errors, specify the batchSize to 1 and turn on the database tracing. These settings can help you determine which SQL statement or input line caused the SQL error. If the batchSize is greater than 1, the JDBC batch update is enabled. With this batch update enabled, relating the SQL error to the input line or SQL statement that caused the error can be difficult.
Running the Data Load utility in insert mode can improve the performance of loading initial loads of large amounts of data. However, when you run the Data Load utility in insert mode the utility does not check your database before the utility loads your input data. The utility does not determine whether the data objects that you are creating exist in your database before the utility attempts to load and create the data objects. This behavior can cause the load process to fail when the utility attempts to create data objects that exist in your database.
For example, if your input file is a CSV file that contains a line that creates a product and a product description. If the file also contains a second line that loads the same product with a description in a different language, the load process fails. Since the utility creates the product and description in the first line, when the utility encounters the second line, the utility attempts to create the product again. Since the product exists, the load fails. To load this data, you can use one of the following methods:
- Run the Data Load utility in replace mode. In replace mode the Data Load utility generates insert, update, or delete SQL statements, depending on the data that you are loading. This mode replaces existing data that is contained in your database with your input data.
- Run the Data Load utility in insert mode, but load your product and description information separately. Use one input file to load your product information, and a different input file to load all of your descriptions.

Configuring the data load environment configuration file (wc-dataload-env.xml)

Consider the following configurations:

Do not specify the <_config:FilePath /> element if all your configuration files are relative to the wc-dataload.XML file.
Encrypt your database password in your configuration file. Run the wcs-encrypt utility as follows.
1. Open a command line shell in the Utilities container and go to the utilities_root/bin directory. For more information, see Running utilities from the Utility server Docker container.
2. Open a command line shell and go to the WCDE_installdir\bin directory.
3. Run the wcs-encrypt utility to learn the encrypted ASCII string.
  - ./wcs-encrypt.sh
  - wcs-encrypt
You can run the wcs-encrypt utility in the utilities_root\bin directory to find out the encrypted ASCII string: wcs-encrypt.bat/sh <plain password>
You can also leave the setting for the database password empty. When the password is empty, you are prompted to enter the password when you run the Data Load utility. You must enter the plain unencrypted password; your password is not echoed on the console for security purposes.
The IDResolver configuration is optional. You can specify the ID resolver cache size. For initial load, specify a large cache size, such as 1 million. For delta load, specify a cache size of 0. If you do not specify the ID resolver configuration, the default cache size is 0.
Instead of hardcoding the data writer in each of the data load business object configuration file, specify the default data writer class in this configuration file.
Set the cache size for the ID resolver to be 0 when you run frequent loads that can include the same information in the repeated loads. For instance, when you are running scheduled updates of your catalog data that overrides existing data. For more information about configuring this cache size and this file, see Configuring the data load environment settings.
If an environment configuration file is specified in the command line, it takes precedence over the element that exists in the data load environment configuration file.

Configuring the data load business object configuration file

Consider the following configurations:

If you are loading files in CSV format, instead of hardcoding the CSV column names in the <_config:Data> node of the <_config:DataReader> element, you can place all the column names in the first line of the CSV file. Then, you must specify firstLineIsHeader="true" in the <_config:DataReader> element.
Note: This configuration applies only when you use the CSVReader interface as your data reader.
When you define the mapping or table configuration in the <_config:BusinessObjectBuilder> element, if the value is from the input data, you do not need to specify the valueFrom attribute. The default is valueFrom="InputData".
If you do not want the Data Load utility to update all columns for a table when you are loading data, you can configure a column exclusion list. Configuring a column exclusion list causes the Data Load utility to avoid loading data into every column of a table. This exclusion configuration provides you the ability to exclude columns from being overwritten if the columns are known to already be populated with data.
For more information, see Configuring a column exclusion list.
When you configure an exclusion list, you can configure the list to with an optional parameter, forUpdateOnly="true". If you configure an exclusion list with this parameter set to true, when you update a table record with an input file that includes changes for an excluded column, the utility does not update the value for the excluded column with the input file value. If the same load operation includes data to create records in the table, the utility ignores the exclusion list and inserts the input file values into all columns for the new records.

If you are loading CSV files that contain leading whitespace, you can avoid errors by configuring the data load CSV reader to trim the token. To do so, set the trimTokenWhiteSpace property to true. For example:

<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader"
                    firstLineIsHeader="false" useHeaderAsColumnName="false">
  <_config:property name="trimTokenWhiteSpace" value="true" />
</_config:DataReader>

CSV input files

Consider the following tips when you are editing or maintaining your CSV files:

For clarity and readability, use the first line of the CSV file as a header.
```
<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="true" >
```
You can use a spreadsheet software to open the CSV file to see whether your data matches the column heading.

For flexibility of using and omitting optional columns or rearranging the columns, use the header as column name instead of hard coding the columns.

<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="true" useHeaderAsColumnName="true" >

instead of:

<_config:Data>
	<_config:column number="1" name="Identifier" />
	<_config:column number="2" name="Name" />
 	<_config:column number="3" name="ShortDescription" />
 	<_config:column number="4" name="LongDescription" />
 	<_config:column number="5" name="Thumbnail" />
 	<_config:column number="6" name="FullImage" />
 	<_config:column number="7" name="Delete" />
</_config:Data>

If you want to use spreadsheet software to edit and save the CSV file, ensure that you use an editor that can save in UTF-8 format. For example, Open Office Calc. Otherwise, the spreadsheet software might reformat your CSV file and invalidate your data for the CSVReader. For example, if you have a time stamp column in the CSV file, not saving in UTF-8 format can reformat the time stamp data according to your locale setting. The reformatted time stamp data might not be valid for the use with your data load.
By default, data load supports only the Java standard time stamp format "yyyy-MM-dd hh:mm:ss.nnnnnn". The Data Load utility can also support a customized time stamp format. If you have a customized time stamp format in your CSV file, you must specify the format in the load item configuration in the wc-dataload.XML file. For example, you can specify the following row in the <_config:LoadItem> element:
```
<_config:property name="timestampPattern" value="yyyy-MM-dd HH.mm.ss" />
```
CSV formatted data is highly flexible and prone to errors. Manually editing comma delineated data can lead to unintentional segmentation, or formatting errors which can cause your data to become not loadable or incorrect.

Loading by unique ID

Specifying the unique ID is optional when you are using the Data Load utility. However, if you specify the unique ID, you save the processing time that is required to resolve the ID, and performance is improved.

Reversing a data load

If you want to reverse a load, you can run the same data load again with dataLoadMode="Delete" specified in the <_config:LoadOrder> element in your wc-dataload.XML data load configuration file. If you are reversing a load, specify the following configuration row within the <_config:LoadOrder> element:

<_config:property name="actionOnError" value="1" />

This configuration row ensures that the Data Load utility continues the process upon a soft delete error. This continuation is because dependent child records no longer exist because of cascade delete.

Tuning the Data Load utility

To reduce any performance impact from running the Data Load utility, you can adjust the Idresolver cache size and parameters that are related to the utility. For more information, see Data Load utility performance tuning.