Tuesday, August 8, 2017

ElasticSearch Indexing Performance Tuning

Hi All,

Here I am going to provide you with some tips in improving indexing performance in ElastcSearch. If you are doing some indexing heavy operations, this would help you to improve the performance in great extent.

Before Performance Tuning

Before concluding that indexing is too slow, be sure that the cluster's hardware is fully utilized: use tools like iostat, top and ps to confirm CPU or IO is saturated across all nodes. If not then it needs more concurrent requests, but if EsRejectedExecutionException is thrown from the java client, or TOO_MANY_REQUESTS (429) HTTP response from REST requests, then it means that there are too many concurrent requests.

Since the settings discussed here are focused on maximizing indexing throughput for a single shard, it is best to first test just a single node, with a single shard and no replicas, to measure what a single Lucene index is capable of on your documents, and iterate on tuning that, before scaling it out to the entire cluster. This can also give a baseline to roughly estimate how many nodes it will need in the full cluster to meet your indexing throughput requirements.

Once single is shard working well, you can take full advantage of Elasticsearch's scalability and multiple nodes in your cluster by increasing the shard count and replica count.

1.       Limit the number of analyzed fields in the candidate.

Analyzed fields are passed through an analyzer to convert the string into a list of individual terms before being indexed. This reduces the indexing performance. The analysis process allows Elasticsearch to search for individual words within each full text field. Analyzed fields are not used for sorting and seldom used for aggregations.

(The string field is unsupported for indexes created in 5.x in favor of the text and keyword fields. Attempting to create a string field in an index created in 5.x will cause Elasticsearch to attempt to upgrade the string into the appropriate text or keyword field. Text is an analyzed field and keyword is not analyzed field.)

2.       Disable merge throttling.

Merge throttling is Elasticsearch’s automatic tendency to throttle indexing requests when it detects that merging is falling behind indexing. It makes sense to update cluster settings to disable merge throttling (by setting indices.store.throttle.type to “none”) if it is needed to optimize indexing performance, not search. You This could be made persistent (meaning it will persist after a cluster restart) or transient (resets back to default upon restart), based on the use case.

3.      Disable Refresh Interval

Increase the refresh interval in the Index Settings API. By default, the index refresh process occurs every second, but during heavy indexing periods, reducing the refresh frequency can help alleviate some of the workload.

4.       Increase translog flush threshold size

When a document is indexed in Elasticsearch, it is first written to write ahead log file called the translog. When the translog is flushed (by default is flushed after every index, delete, update, or bulk request, or when the translog becomes a certain size, or after a time interval) Elasticsearch then persists the data to disk during a Lucene commit, an expensive operation.
The translog helps prevent data loss in the event that a node fails. It is designed to help a shard recover operations that may otherwise have been lost between flushes.

Once the translog hits the index.translog.flush_threshold_size size, a flush will happen.

Index.translog.flush_threshold_size can be increased from the default 512 MB to something larger, such as 1 GB which allows larger segments to accumulate in the translog before a flush occurs. By letting larger segments build, flush happens less often, and the larger segments merge less often. All of this adds up to less disk I/O overhead and better indexing rates.

5.       Disable the number of replicas

When documents are replicated, the entire document is sent to the replica node and the indexing process is repeated verbatim. This means each replica will perform the analysis, indexing, and potentially merging process.
In contrast, if indexed with zero replicas and then enable replicas when ingestion is finished, the recovery process is essentially a byte-for-byte network transfer. This is much more efficient than duplicating the indexing process.

6.       Index ID – Using auto generated IDs

When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.

Note: This can improve the indexing performance greatly. 

7.       The number of nodes

There is no hard and fast rule for determining the number of nodes required for a cluster. It leads to start with a single node and then increase the number of nodes until you get the expected performance.
A node is a single server/running instance that is part of the cluster, stores data, and participates in the cluster’s indexing and search capabilities.
Once a single node has reached its maximum performance (CPU, Memory, IO) then a new node can be added to the cluster and the load can be balanced across the cluster. This is done through the elastic search client using Round Robin Strategy to balance the load against the nodes. Transport Client automatically does this. 

8.       Tweak the VM Options – Increase heap size

By default, Elasticsearch tells the JVM to use a heap with a minimum and maximum size of 2 GB. When moving to production, it is important to configure heap size to ensure that Elasticsearch has enough heap available. Elasticsearch will assign the entire heap specified in jvm.options via the Xms (minimum heap size) and Xmx (maximum heap size) settings.

The value for these setting depends on the amount of RAM available on your server. Good rules of thumb are:
  • Set the minimum heap size (Xms) and maximum heap size (Xmx) to be equal to each other.
  • The more heap available to Elasticsearch, the more memory it can use for caching. But note that too much heap can subject you to long garbage collection pauses.
  • Set Xmx to no more than 50% of your physical RAM, to ensure that there is enough physical RAM left for kernel file system caches. 


9.       Bulk processor tuning

Bulk indexing requests should be used for optimal performance. Bulk sizing is dependent on data, analysis, and cluster configuration, but a good starting point is 5–15 MB per bulk. Note that this is physical size. Document count is not a good metric for bulk size. For example, if 1,000 documents are indexed per bulk:
·         1,000 documents at 1 KB each is 1 MB.
·         1,000 documents at 100 KB each is 100 MB.
Those are drastically different bulk sizes. Bulks need to be loaded into memory at the coordinating node, so it is the physical size of the bulk that is more important than the document count.
Start with a bulk size around 5–15 MB and slowly increase it until there is no more performance gain. Then increasing the concurrency of the bulk ingestion should be started (multiple threads, and so forth).


Hope that helps.
Thank You.

References : 

https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
https://www.elastic.co/guide/en/elasticsearch/reference/master/heap-size.html

-->

No comments:

Post a Comment