
Configuring the Data Load utility to run a file difference preprocess
If you routinely load the same generated Data Load input file from an external system or source, you can choose to run a file difference preprocess as part of the Data Load process to ensure that you are loading only new changes when you load your newest input file.
Before you begin
- Identify the two Data Load input files that you want to compare and generate a difference file from.
- Successfully load the old input file into your WebSphere Commerce database. If any records are in the old file and in the new file that the old file is compared with, these records are not in the generated difference file or in your database. To prevent records from being omitted without ever being loaded into your database, verify that the contents of the old file are loaded into your database.
About this task
A file difference tool is available as a data reader preprocessor when you run the Data Load utility. This file difference preprocessor can be used to read and compare two CSV or two XML files. The preprocessor uses a different data reader class to read CSV files (CSVFileDiffPreprocessor) and XML files (XmlFileDiffPreprocessor) so you cannot compare a CSV file to an XML file.
Configuration properties for file difference preprocessor
keyColumns
property,
which must be specified in the business object configuration file.Configuration property | Description |
---|---|
keyColumns |
Mandatory. Key columns are the CSV columns or XML elements that uniquely identify a record in your input file. |
numberOfSplitFiles |
Optional. Use this property to specify how many files that
the input files are to be split into when the old input file is too
large to be stored in memory. It is recommended that the |
checkDuplicatedKeys |
Optional. Specify this property as true to perform an extra
check for duplicate entries. It is recommended that you specify |
diffFileDirectory |
Optional. This property is for changing the directory where the generated difference file is saved. |
dataReaderPreprocessOnly |
Optional. Specify this property as true stops the Data Load process after the difference file is generated and saved. |
cleanupSplitFiles |
Optional. If your input files are split, you can set this property to false to save the temporary generated smaller files. If this property is set to true or omitted, the generated smaller files are deleted after the files are merged. |
columnBasedCompare |
Optional. Indicates whether the preprocessor is to use a
column-based comparison to compare files. You can set the following
values for this property:
Note: Configuring a column-based comparison can
take longer to complete than using the default file difference preprocessor
behavior. With a column-based comparison, the preprocessor must complete
an extra look-up between the files. |
includeCompareColumns |
Optional. Indicates whether the file difference preprocessor
is to compare only specific columns. Use a comma-separated list as
the value for this property to identify the columns to be compared.
Any column that is not in this list is ignored during the file comparison.
When you include this property, the columnBasedCompare property
is configured by default with a value of true when
the property is not explicitly configured.If you include both the If you include the Note: If you include the includeCompareColumns property
and do not set a value and the excludeCompareColumns property
is not set with a value, the file difference preprocessor compares
only the key columns. The generated difference file then includes
only the records from the new input file that have a key column value
that is not in the old input file. |
excludeCompareColumns |
Optional. Indicates whether the file difference preprocessor
is to exclude specific columns from being compared. Use a comma-separated
list as the value for this property to identify the columns to be
excluded from comparison. All other columns are compared. When you
include this property, the columnBasedCompare property
is configured by default with a value of true when
the property is not explicitly configured.If you include both the If you include the |
Configuring the file difference to handling large input files
The file difference preprocess loads the old input file into a hash map in your system memory and compares this hash map to the new input file to generate a difference file. If the old file is too large to be loaded into your system memory, the file difference preprocessor splits the file into smaller files. The new input file is also split into the same number of smaller files. The preprocess generates a difference file for each pairing of these smaller files and then merges these files into a single larger difference file.
By default, the file difference preprocessor automatically determines the number of files that are required to split a large file. You can choose to configure the number of files that your large input files are split into. If you do configure this property, ensure that you specify a large enough number of files so that all records in the input file can be stored in memory.
Splitting the input files into smaller files does require processing time and disk space. If your system has sufficient physical memory and uses a 64-bit JVM, you increase the JVM maximum heap size to handle large input files. If your system does have sufficient memory that is allocated and the preprocess does not split the input files, the difference file can be generated faster. For more information about tuning your JVM performance, including your JVM heap size, see JVM performance tuning.
Procedure
Example
<_config:DataLoadConfiguration xsi:schemaLocation="http://www.ibm.com/xmlns/prod/commerce/foundation/config ../xsd/wc-dataload.xsd">
<_config:DataLoadEnvironment configFile="wc-dataload-env.xml"/>
<_config:LoadOrder commitCount="100" batchSize="1" dataLoadMode="Replace" >
<_config:LoadItem name="CatalogEntry" businessObjectConfigFile="wc-loader-catalog-entry.xml" >
<_config:property name="dataReaderPreprocessOnly" value="true"/>
<_config:DataSourceLocation location="c:/temp/dataload/samples/CatalogEntryNew.csv" oldLocation="c:/temp/dataload/samples/CatalogEntryOld.csv" />
</_config:LoadItem>
</_config:LoadOrder>
</_config:DataLoadConfiguration>
In this example configuration
file, the two files are located in a temporary sample directory. After
the preprocessor completes, the generated difference file, CatalogEntryNew_diff_2013.03.28_12.01.01.001.csv,
is saved in the same temporary directory. This sample includes the dataReaderPreprocessOnly
configuration
property that causes the Data Load utility to run only the file difference
preprocessor. To run the preprocessor the configuration file specifies
that the Data Load utility is to run in Replace mode.