The Elasticsearch index lifecycle

Structure, advantages and lifecycle of the Version 9.1. Elasticsearch system.

Structure of a multidimensional index

In the HCL Commerce Version 9.1 implementation of Elasticsearch, index schemas are provided for each product index. Each schema is further subdivided into several classifications. For example, the Product schema is broken down into Entitlement, Price, Inventory, Browsing, Attributes, and Properties. Each classification can further be indexed into multiple field types based on usages, such as Display, Searching, Filtering, Faceting, Boosting, or Sorting. To examine the schema for this example, see Ingest Product index pipeline.

The Ingest service creates a separate indexed document for each supported language, for each supported catalog, and for each store. This might be concerning if you expect that the result will be a very large index, and the size does matter initially when you set up the index. However, Elasticsearch's ability to do incremental updates changes the equation once the index is set up. If day to day changes are not large you will not find performance to be greatly affected, as it would be if you had to do regular full reindexing. In addition, the use of esites also improves performance. Each esite has its own index, because each store may have its own data lifecycle. Optionally, if the esites share the data lifecycle and share the catalog, you can use the master catalog approach with the stores.

Any indexed document can be used in multiple contexts, such as approved (unpublished), live (published). These can have work-in-progress, updated, or overwritten values, each expressed as a new index field name with the appropriate prefix. Note that new (or to-be-deleted) documents are tagged as new (or deleted) to avoid being included in the Live index.

Push-To-Live requests

When Staging Propagation starts, the publish operation on the Authoring Transaction server will push or replicate all production ready changes from the Auth database to the Live database.

This Push-To-Live approach no longer requires replicating to Subordinate nodes. Instead, a copy of the new live index will be created in the Live environment and will be swapped in once the new index is ready. The old version will be de-commissioned immediately.

Data object caches and REST caching that are used in the Live environment Query Service are invalidated; JSP caching used in the Store server may also be invalidated if applicable.

For more information on Push-to-Live, see Push to Live in Search.

Initial index setup

To do an initial index setup, perform a full reindexing operation of the search indices for the specific store. The indexing process starts with the creation of the schema, the product database, and STA. Data is loaded by drilling progressively down into catalog and then category. Then the main flow loop takes over, processing attributes to products to SEO URLs and finally emerging from that loop, to price.

In subsequent full reindexing operations, the new index is built in parallel to the existing one. Once the new index is ready, the alias for index is simply updated with the location of the new one. At that point the new index becomes the active one. There is also a regular cadence of, eg., inventory updates that can be scheduled as regular jobs. These updates can be done through 'smart copying' into the index and the price. The smart copy does not copy every update, since many will not have changed between runs. Copying everything would trigger the equivalent of a full invalidation; therefore, the smart copy only moves items that have been changed to the index and only they will trigger an invalidation.

During this process and afterward, Natural Language Processing (NLP) is being done and matchmaker relationships such as color matching are constructed. These elements do not rely on the index being finished, but will supplement its contents when they are complete. NLP is expensive to run, however there are features of the NLP dataset such as lemmatization that rarely change. The first NLP run is a bit longer because it performs full lemmatization and writes to the index; subsequent iterations run are faster, because they do not have to repeat already-completed operations.

Near Real-Time (NRT) updates in the Authoring environment

All business data updates in the Management Center are first written to the database. This is followed by an incremental update event (along with Authoring context) done through Redis. There is business logic built into the Apache NiFi indexing pipeline that is used for analyzing and processing these change requests.

Acting as the message bus, Redis broadcasts this change request to all NiFi connectors that monitor for this kind of operation. The result is incremental updates being made to appropriate Search indices. The data object cache and REST caching used in the Query Service for Authoring environment are invalidated; JSP caching used in Store server may also be invalidated if applicable.

This approach can provide a near-real-time update experience via the Query service, when previewing the Storefront in the Authoring environment.

Post NRT updates you can check the status of the search index for the specific store using the following endpoint:
http://<HOSTNAME>:<port>/search/resources/api/v2/data/status?storeId=1
A Swagger interface to this endpoint, V2-Data-Status, is available in the Query REST API. .
Note: envType parameter for this endpoint is an optional parameter. By default its value is auth.

Dataload with Elasticsearch-based search

Configuration
Dataload needs to be configured to trigger incremental index updates. For more information, see Data Load utility.
Index update process

After the Dataload process updates the Commerce database with business data, the corresponding change history is then written to the two TI_DELTA database tables, as is the case with the corresponding Solr process. Once the load operation has completed successfully, an event is sent through Redis to NiFi to launch an indexing flow in NiFi.

The data object cache and REST caches used in the Query Service for the Authoring environment are invalidated. Any JSP caching used in the Store server may also be invalidated if applicable.

Note: Because Dataload is run as a non-interactive batch process, and may contain a large amount of data updates, the search indices may take a bit longer to complete and changes only be visible after cache invalidation takes place