Here I am going to provide you with some tips in improving indexing performance in ElastcSearch. If you are doing some indexing heavy operations, this would help you to improve the performance in great extent.
Before Performance Tuning
Before concluding that indexing is too slow, be sure that the cluster's hardware is fully utilized: use tools like iostat, top and ps to confirm CPU or IO is saturated across all nodes. If not then it needs more concurrent requests, but if EsRejectedExecutionException is thrown from the java client, or TOO_MANY_REQUESTS (429) HTTP response from REST requests, then it means that there are too many concurrent requests.
Since the settings discussed here are focused on maximizing indexing throughput for a single shard, it is best to first test just a single node, with a single shard and no replicas, to measure what a single Lucene index is capable of on your documents, and iterate on tuning that, before scaling it out to the entire cluster. This can also give a baseline to roughly estimate how many nodes it will need in the full cluster to meet your indexing throughput requirements.
Once single is shard working well, you can take full advantage of Elasticsearch's scalability and multiple nodes in your cluster by increasing the shard count and replica count.
1. Limit the number of analyzed fields in the
candidate.
Analyzed
fields are passed through an analyzer to convert the string into a list of
individual terms before being indexed. This reduces the indexing performance. The
analysis process allows Elasticsearch to search for individual words within
each full text field. Analyzed fields are not used for sorting and seldom used
for aggregations.
(The string
field is unsupported for indexes created in 5.x in favor of the text and keyword fields. Attempting to create a string field in an index created in 5.x will cause Elasticsearch to
attempt to upgrade the string into
the appropriate text or keyword field. Text is an analyzed field and keyword
is not analyzed field.)
2.
Disable
merge throttling.
Merge throttling
is Elasticsearch’s automatic tendency to throttle indexing requests when it
detects that merging is falling behind indexing. It makes sense to update
cluster settings to disable merge throttling (by setting indices.store.throttle.type to “none”) if it is needed to optimize
indexing performance, not search. You This could be made persistent (meaning it
will persist after a cluster restart) or transient (resets back to default upon
restart), based on the use case.
3. Disable Refresh Interval
Increase the
refresh interval in the Index Settings API. By default, the index refresh
process occurs every second, but during heavy indexing periods, reducing the
refresh frequency can help alleviate some of the workload.
4.
Increase
translog flush threshold size
When a document
is indexed in Elasticsearch, it is first written to write ahead log file called
the translog. When the translog is flushed (by default is flushed after every
index, delete, update, or bulk request, or when the translog becomes a certain
size, or after a time interval) Elasticsearch then persists the data to disk
during a Lucene commit, an expensive operation.
The translog
helps prevent data loss in the event that a node fails. It is designed to help
a shard recover operations that may otherwise have been lost between flushes.
Once the
translog hits the index.translog.flush_threshold_size
size, a flush will happen.
Index.translog.flush_threshold_size can be increased from the default 512 MB to something
larger, such as 1 GB which allows larger segments to accumulate in the translog
before a flush occurs. By letting larger segments build, flush happens less
often, and the larger segments merge less often. All of this adds up to less
disk I/O overhead and better indexing rates.
5.
Disable
the number of replicas
When documents are
replicated, the entire document is sent to the replica node and the indexing
process is repeated verbatim. This means each replica will perform the
analysis, indexing, and potentially merging process.
In contrast, if indexed with zero replicas and
then enable replicas when ingestion is finished, the recovery process is
essentially a byte-for-byte network transfer. This is much more efficient than
duplicating the indexing process.
6.
Index
ID – Using auto generated IDs
When indexing a
document that has an explicit id, Elasticsearch needs to check whether a
document with the same id already exists within the same shard, which is a
costly operation and gets even more costly as the index grows. By using
auto-generated ids, Elasticsearch can skip this check, which makes indexing
faster.
Note: This can improve the indexing performance greatly.
7.
The
number of nodes
There is no hard
and fast rule for determining the number of nodes required for a cluster. It
leads to start with a single node and then increase the number of nodes until
you get the expected performance.
A node is a
single server/running instance that is part of the cluster, stores data, and
participates in the cluster’s indexing and search capabilities.
Once a single node
has reached its maximum performance (CPU, Memory, IO) then a new node can be
added to the cluster and the load can be balanced across the cluster. This is
done through the elastic search client using Round Robin Strategy to balance
the load against the nodes. Transport Client automatically does this.
8.
Tweak
the VM Options – Increase heap size
By default,
Elasticsearch tells the JVM to use a heap with a minimum and maximum size of 2
GB. When moving to production, it is important to configure heap size to ensure
that Elasticsearch has enough heap available. Elasticsearch will assign the
entire heap specified in jvm.options via the Xms (minimum heap size) and Xmx
(maximum heap size) settings.
The value for
these setting depends on the amount of RAM available on your server. Good rules
of thumb are:
- Set the minimum heap size (Xms) and maximum heap size (Xmx) to be equal to each other.
- The more heap available to Elasticsearch, the more memory it can use for caching. But note that too much heap can subject you to long garbage collection pauses.
- Set Xmx to no more than 50% of your physical RAM, to ensure that there is enough physical RAM left for kernel file system caches.
9.
Bulk
processor tuning
Bulk indexing
requests should be used for optimal performance. Bulk sizing is dependent
on data, analysis, and cluster configuration, but a good starting point is 5–15
MB per bulk. Note that this is physical size. Document count is not a good
metric for bulk size. For example, if 1,000 documents are indexed per bulk:
·
1,000 documents at 1 KB each is 1 MB.
·
1,000 documents at 100 KB each is 100 MB.
Those are
drastically different bulk sizes. Bulks need to be loaded into memory at the
coordinating node, so it is the physical size of the bulk that is more
important than the document count.
Start with a bulk size around 5–15 MB and slowly
increase it until there is no more performance gain. Then increasing the
concurrency of the bulk ingestion should be started (multiple threads, and so
forth).
Hope that helps.
Thank You.
References :
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
https://www.elastic.co/guide/en/elasticsearch/reference/master/heap-size.html
-->
No comments:
Post a Comment