The Elasticsearch index lifecycle

Structure of a multidimensional index

In the HCL Commerce Version 9.1 implementation of Elasticsearch, index schemas are provided for each product index. Each schema is further subdivided into several classifications. For example, the Product schema is broken down into Entitlement, Price, Inventory, Browsing, Attributes, and Properties. Each classification can further be indexed into multiple field types based on usages, such as Display, Searching, Filtering, Faceting, Boosting, or Sorting. To examine the schema for this example, see Ingest Product index pipeline.

The Ingest service creates a separate indexed document for each supported language, for each supported catalog, and for each store. This might be concerning if you expect that the result will be a very large index, and the size does matter initially when you set up the index. However, Elasticsearch's ability to do incremental updates changes the equation once the index is set up. If day to day changes are not large you will not find performance to be greatly affected, as it would be if you had to do regular full reindexing. In addition, the use of esites also improves performance. Each esite has its own index, because each store may have its own data lifecycle. Optionally, if the esites share the data lifecycle and share the catalog, you can use the master catalog approach with the stores.

Any indexed document can be used in multiple contexts, such as approved (unpublished), live (published). These can have work-in-progress, updated, or overwritten values, each expressed as a new index field name with the appropriate prefix. Note that new (or to-be-deleted) documents are tagged as new (or deleted) to avoid being included in the Live index.

Push-To-Live requests

When staging propagation starts, the publish operation on the Authoring Transaction server will push or replicate all production ready changes from the Auth database to the Live database.

This Push-To-Live (PTL) approach no longer requires replicating to Subordinate nodes. Instead, a copy of the new live index will be created in the Live environment and will be swapped in once the new index is ready. The old version will be de-commissioned immediately.

Data object caches and REST caching that are used in the Live environment Query Service are invalidated; JSP caching used in the Store server may also be invalidated if applicable.

Note: When you are working in the Live environment, build the Live inventory index before running Push-To-Live. For more on this and for more general information about PTL, see Push to Live in Search.

Initial index setup

To do an initial index setup, perform a full reindexing operation of the search indices for the specific store. The indexing process starts with the creation of the schema, the product database, and STA. Data is loaded by drilling progressively down into catalog and then category. Then the main flow loop takes over, processing attributes to products to SEO URLs and finally emerging from that loop, to price.

In subsequent full reindexing operations, the new index is built in parallel to the existing one. Once the new index is ready, the alias for index is simply updated with the location of the new one. At that point the new index becomes the active one. There is also a regular cadence of, eg., inventory updates that can be scheduled as regular jobs. These updates can be done through 'smart copying' into the index and the price. The smart copy does not copy every update, since many will not have changed between runs. Copying everything would trigger the equivalent of a full invalidation; therefore, the smart copy only moves items that have been changed to the index and only they will trigger an invalidation.

During this process and afterward, Natural Language Processing (NLP) is being done and matchmaker relationships such as color matching are constructed. These elements do not rely on the index being finished, but will supplement its contents when they are complete. NLP is expensive to run, however there are features of the NLP dataset such as lemmatization that rarely change. The first NLP run is a bit longer because it performs full lemmatization and writes to the index; subsequent iterations run are faster, because they do not have to repeat already-completed operations.

Near Real-Time (NRT) updates in the Authoring environment

All business data updates in the Management Center are first written to the database. This is followed by an incremental update event (along with Authoring context) done through Redis. There is business logic built into the Apache NiFi indexing pipeline that is used for analyzing and processing these change requests.

Acting as the message bus, Redis broadcasts this change request to all NiFi connectors that monitor for this kind of operation. The result is incremental updates being made to appropriate Search indices. The data object cache and REST caching used in the Query Service for Authoring environment are invalidated; JSP caching used in Store server may also be invalidated if applicable.

This approach can provide a near-real-time update experience via the Query service, when previewing the Storefront in the Authoring environment.

Post NRT updates you can check the status of the search index for the specific store using the following endpoint:

http://<HOSTNAME>:<port>/search/resources/api/v2/data/status?storeId=1

A Swagger interface to this endpoint, V2-Data-Status, is available in the Query REST API.

Note: envType parameter for this endpoint is an optional parameter. By default its value is auth.

Data Load with Elasticsearch-based search

The Data Load utility is a tool that loads data from a source file into a target database. It can also delete data from a database. Dataload supports incremental updates of catalog data if the Elasticsearch search engine is enabled.

Configuration

Configure Dataload to trigger incremental index updates. For more information, see Data Load utility.

Index update process

After the Dataload process updates the Commerce database with business data, the corresponding change history is then written to the two TI_DELTA database tables, as is the case with the corresponding Solr process. Once the load operation has completed successfully, an event is sent through Redis to NiFi to launch an indexing flow in NiFi.

The data object cache and REST caches used in the Query Service for the Authoring environment are invalidated. Any JSP caching used in the Store server may also be invalidated if applicable.

Note: Because Dataload is run as a non-interactive batch process, and may contain a large amount of data updates, the search indices can take longer to complete and changes are only visible after cache invalidation takes place.

Considerations when using NRT with Dataload:

In the Solr architecture, changing a product name necessitates a delta indexing process to rebuild the full document for the given product. This is because Solr has no built-in logic to identify the specific change being applied. When you use NRT with Elasticsearch, fields such as Product are updated directly. This can provide performance gains over Solr.
Offline data load is the preferred method as it uses direct JDBC connectivity and is faster when compared to catalog upload. Use offline data load to load large volumes of data.
Data updated through Dataload will be processed in the same way as it does with Solr-based search. The utility makes use of the TI_DELTA_CATENTRY and CATGROUP database tables to keep track of which product or category has been updated.
Elasticsearch is different than Solr in that a "Complete" event will be sent by Dataload through Redis to NiFi once the data load operation is completed. This event may take up to four minutes as the scheduler job running on the Transaction server recurs only once at each regular interval.
Once this "Complete" event reaches NiFi, the auth.dataload connector processes those products or categories identified in TI_DELTA_CATENTRY and TI_DETLTA_CATGROUP.
When workspaces are enabled, data can be loaded directly into either the approved content schema or the workspace schema. The auth.dataload connector in NiFi can perform ingest against the appropriate database schema based on the given workspace context.
Whether or not you use workspaces, once Dataload has successfully loaded data into the authoring environment, Push-To-Live can be used to move the indexed changes over to the Live environment. Refer to the steps for Push-To-Live on details how to perform this task. Run StagingProp before pushing the indexed data from Authoring to Live.
When Push-To-Live is performed, the authoring index will be cloned to the Live environment (with the exception of all workspace related documents and meta data), which is then following by a series of relevant cache invalidation events. This fine-grained cache invalidation is based on only products or categories that have been updated as a result of Staging Propagation and Push-To-Live.

Index backups

When you build an index, by default backup copies are kept of the previous two builds. This setting does not affect performance, although, in certain circumstances you may notice apparent inconsistencies in your logs. For example, if you run several Store 1, Store 11 or eSite index rebuilds in a quick succession, you would expect to see only one index present with the latest timestamp. Because backups are kept, however, in such a case you may observe two indexes present that have the same timestamp. This merely indicates that backups are being created and that the most recent backup has been given the same timestamp as the current live index.

You can control index backup behavior using the alias.keep.backup configuration setting. First, you can verify the current setting by issuing a GET request to the following REST endpoint:

https://Data-Query:Port/search/resources/api/v2/configuration?nodeName=ingest&envType=auth

Search for the alias.keep.backup parameter. Its default setting is 2.

To change the setting, issue a PATCH request to the same endpoint. For example, to set the number of backups to zero (the default behavior prior to Version 9.1.13), issue the PATCH with the following body:

Body: 
{
    "global": {
        "connector": [
            {
                "name": "attribute",
                "property": [
                    
{                                                  "name": "alias.keep.backup",                         "value": "0"                                        }

                ]
            }
        ]
    }
}

For this example, the result is that no backup indexes are created. You can adjust the number of backups as appropriate to your environment.

Switching to a backup index

When Ingest performs a full re-indexing operation, each build is tagged with a unique revision identifier. This identifier is a twelve digit number in the format of YYYYMMDDHHMM, called time.id. For example, the product index name can be .auth.11.product.202405142058. To reveal all revisions from the available indexes in Elasticsearch, use the following Elasticsearch API:

/_cat/indices?v&s=index

For more information about this endpoint, refer to cat indices API in the Elasticsearch documentation.

When you have identified the time.id that you want to swtich to, use the following Ingest API to perform the index alias switching:

POST /connectors/envType.alias/run?storeId=storeId&timeId=timeId

where

envType: Either "auth" or "live".
storeId: The owning store identifier.
timeId: The twelve-digit time identifier.