Elasticsearch scaling and hardware requirements

Clustering

Elasticsearch: Elasticsearch comes installed by default as a three shard cluster. Both minimal and recommended sizing implements such clustering, and the only differences are the resources that are allocated. The recommended sizing has more vCPUs and memory per pod, which is sufficient to drive traffic and build the index.
If required, additional capacity can be added by scaling horizontally, by increasing the cluster size with additional nodes/pods for Elasticsearch. It is recommended, however, to omptimize your infrastructure performance with faster storage, faster network interconnection, faster memory and CPU, and more memory and CPU allocation, before scaling horizontally.
NiFi: NiFi is configured as single server, in both minimal and recommended configurations. For typical expected workloads this is sufficient. However, if Natural Language Processing (NLP) processing presents a bottleneck, NiFi horizontal clustering will improve NLP throughput, with linear scalability.

Sharding

It is useful to know the optimal number of index shards to be used as your data grows during production. You can determine this based on the existing size of the search index. Use the following three rules to calculate when to adjust the number of index shards.

An index shard should not not exceed 40% of the total available storage of its node cluster.
An index shard size should not exceed 50 GB; generally the index performs best when its size is less than 25 GB per shard.
The document counts and sizes across shards should be similar.

Hardware footprint

There are several factors in considering the hardware footprint and key resources that impact the processing and index creation speed.

Resources influencing the operations

The default installation of the Nifi an Elasticsearch e-commerce cluster assumes one NiFi pod, coupled with 3 clustered pods with Elasticsearch. The key resources are CPU, RAM/Heap sizes and IO subsystem.

The minimum CPU resources that are recommended for each pod are set to 6 vCPUs per pod. This is shown to have acceptable performance when building catalogs of medium size (assuming 300 000 items). However, each catalog is different, and catalogs that have an excessively large attribute dictionary can require extra resources to keep up with the increased processing demand.

In general, NiFi processing will comfortably fit into the allocated CPU resource size, except in the case of NLP processing. Typically NLP processing is not CPU bound. Before increasing allocated CPU resources (to boost NLP processing speed), it is recommended to re-test the index build with the existing hardware. During a second index build, NLP will re-use some of the computation from the initial run, reducing the overall processing time dramatically. Use these repeated builds to derive any increased resource requirements.

More importantly, the heap sizes need to be adjusted to fit the indexed data size and complexity. The NiFi and Elasticsearch process is streamlined, and as long as the configuration is kept same, the heap should be sufficient for any size of the catalog.

However, some adjustment will be required if additional optimization is attempted on larger catalogs (bigger flow files, more threads, etc.), or if the produced data set itself becomes larger (due to a larger number of attributes, for example).

In the provided minimal and recommended cases, there are two heap configurations, targeting the aforementioned 300,000 and 1M catalog. The tuning parameters are different between these configurations, since a 1M item catalog requires larger heap sizes in both NiFi and Elasticsearch (12GB and 16GB accordingly).

The most influential resource to the overall operation of the NiFi and Elasticsearch coupling is the Elasticsearch I/O subsystem, which is generally driving the overall speed of the processing. If Elasticsearch stores the indexed data on the disk slowly, it will also slow down the overall build process, including NiFi execution speed. Thus, the file/io system on Elasticsearch must be considered early, and ideally configured on local SSD/NVMe storage for maximum throughput and I/O rates.

Defined minimal and recommended hardware footprint

You have two primary configuration deployment settings. One is for the default deployment configuration, also known as the minimal configuration for NiFi and Elasticsearch, and a second, known as the recommended configuration, that has shown good results with larger catalogs with approximately 1M items.

The following table shows resource values for minimal and recommended hardware footprint:


	Pods	vCPUs/ES Pod	vCPU/NiFi Pod	ES heap	NiFi Heap
Minimal Configuration (default)	3 - ES 1- Nifi	6	6	12	9
Recommended Configuration	3 - ES 1- Nifi	16	16	16	12