Data Load file difference preprocessing

You can run a file difference preprocess for routine data loads to improve the Data Load utility performance for loading these files. Running a file difference can reduce the loading time that is required to load your routine updates to your WebSphere Commerce database, reduce server usage time, and improve server performance.

The file difference preprocessor is available only for the Data Load utility. By using this preprocessor that you can compare two input files, such as a previously loaded file and the newest version of this file. The preprocessor generates a difference file that contains only the records in the new file that are not within the old file or that are changed from the records in the old file. The Data Load utility can then load this difference file. If your routinely loaded files contain many previous loaded records, then running this file difference can result in shorter load times. This preprocess can be scaled to compare files with millions of records.

The file difference preprocessor is not a general-purpose file difference tool. If the contents of the old file you are comparing exists in your WebSphere Commerce database, loading the generated difference file into your database is the equivalent of loading the entire new file. If the generated difference file is smaller than your new file, loading the difference file can reduce the overall loading time that is required to update your database to match the contents of your new file.

You can also use the file difference preprocessor as a separate process from the actual loading of data into your database. You can use this preprocess to generate a difference file but not load the file. By pausing this preprocess before the file loads, the preprocessor does not affect your database or your WebSphere Commerce system performance. You can choose to load the difference file later. Load the difference file with the Data Load utility when the loading process has the least impact on your database and WebSphere Commerce system performance.

The file difference is implemented as a data reader preprocessor. It runs at the beginning of the data reader initialization when you run the Data Load utility. By default, there are two data reader preprocessors that are provided for running a file difference; one for comparing CSV files (CSVFileDiffPreprocessor) and one for XML files (XmlFileDiffPreprocessor).

The data reader preprocessor is specified as a DataReaderPreprocessor subelement within the DataReader element of the Data Load business object configuration file. For example:

<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="true" useHeaderAsColumnName="true" >
   <_config:DataReaderPreprocessor className="com.ibm.commerce.foundation.dataload.datareader.CSVFileDiffPreprocessor" />
</_config:DataReader>

You do not need to explicitly specify this preprocessor to run a file difference. To run the file difference, you must specify only the key column property values to uniquely identify records in your input files. You must also specify the file location for the older file you want to compare. If you include these two required file difference properties in your configuration files, the file difference preprocessor automatically runs when you run the Data Load utility. For more information about configuring the Data Load utility to run a file difference, see Configuring the Data Load utility to run a file difference preprocess.

Best Practices

When you run the file difference preprocessor, ensure that you consider the following tips and recommendations:

Use the file difference only when your data is updated and managed in your backend system and routinely loaded into your production database.
Ensure that your old input file is successfully loaded. Make sure that any errors encountered in this file are fixed in the file and in your source database. If you fix the errors in the old file, but not in your source database, these errors might exist in your new input file. These errors might also be included in the generated difference file.
Ensure that your Data Load utility is configured to run in replace mode.
If your input files are CSV files, ensure that you include a header with the appropriate columns. Changes to this header between files can cause the entire new file to be identified as changed and included in the difference file.
Ensure that you do not rearrange your columns or XML elements. If this order is changed all records are considered changed and included in the difference file. The file difference uses a string comparison to compare the full record entry in each file. You do not need to sort the order of your records, if the order that records appear is rearranged. The preprocessor can identify and omit duplicate records.
The file difference preprocessor can perform a column-based comparison so the preprocessor does not need to compare each record by the full record string. With the column-based comparison, the arrangement of the columns or XML elements in the files can be ignored. You can configure the comparison to include or exclude specific columns. Use column-based comparison when input files include columns or XML elements that are arranged differently between various input files.
If your files include columns or XML elements that include values that are always different and do not affect whether a record is truly changed, exclude the column from comparison. To exclude this type of column, configure the preprocessor to use column-based comparison and configure a column exclusion list for the preprocess.
If your files include only a few columns or elements that determine whether a record is truly changed and must be updated, you can compare only these columns or elements. To include only these columns, configure the preprocessor to use column-based comparison and configure a column inclusion list for the preprocess.
Ensure that your configuration files identify the correct files for comparison. For example, if you routinely run catalog and inventory loads, ensure that you do not compare an inventory file with a catalog file. If you do, then all records in the new file are included in the generated difference file. Also, no loading time is saved in running the preprocessor.

Limitations

When you run the file difference preprocessor to help improve the data load performance, you must understand the file difference behavior and limitations:

You must run the file difference preprocess with the Data Load utility in replace mode. This preprocess generates an error if you are running the Data Load utility in insert or delete mode.
The file difference can compare only two CSV files or two XML files. If you specify a CSV file and an XML file to be compared, errors occur.
The generated difference file can contain records that exist in your database. The preprocessor compares only the two input files. There is no comparison against your database to omit records from the difference file that exist in your database.
The generated difference file can contain records that are loaded with the old input file when the preprocessor encounters minor changes. For example, if the column order for your records is changed between files, the difference file includes these records even if the data for these columns is not changed.
You can configure the preprocessor to use a column-based comparison, which can ignore certain minor changes. With this comparison the preprocessor can ignore the following minor differences between records and files:
- A column value includes no quotation marks in one file and includes double quotation marks in the other file, such as for CSV tokens.
- A record includes one or more extra commas at the end of the record in one file.
- The files include the same columns, but in a different order.
The file difference preprocessor compares all of the data in your input files, even the data in columns that are excluded from being loaded. If the preprocessor encounters differences because of these columns, records can be included in the difference file that are duplicates of the actual data that is loaded with the old input file.
You can configure the preprocessor to ignore specific columns, or to compare only specific columns. If you configure the preprocessor to ignore specific columns or to compare only specific columns, the preprocessor uses column-based comparison automatically.
Running a file difference for XML files can require more processing time than a comparison of CSV files. This difference in processing time is because XML files are much larger than CSV files that contain the same amount of data.
The preprocessor behaves as though the old input file is successfully loaded into your database before it runs the file difference. If the old file did not load successfully, you must fix any errors that are encountered during that load process and ensure that the old file is loaded. If the old file is not loaded, then the generated difference file might not be loaded successfully. Unsuccessful loading might depend on some records in the old file, which are supposed to be loaded already.
If a user changes the data in your WebSphere Commerce database after the old input file is loaded, then loading the generated difference file might not be equivalent to loading the entire new file. If you are using the file difference preprocess, it is recommended that you do not update the same data with other WebSphere Commerce tools, such as Management Center.
If a workspace approver can change the data after you load the old file, it is not recommended that you run the file difference on data that you load into the workspace.
If you configure the preprocessor to use column-based comparison, the CSV file difference preprocess can take longer to complete.