Tuesday, August 22, 2017

Abhi: JMockit Tutorial:Learn it today with examples

Abhi: JMockit Tutorial:Learn it today with examples: JMockit is one of the many mocking frameworks available for unit testing in Java. If you have to unit test you'll invariably end up moc...

Tuesday, August 8, 2017

ElasticSearch Indexing Performance Tuning

Hi All,

Here I am going to provide you with some tips in improving indexing performance in ElastcSearch. If you are doing some indexing heavy operations, this would help you to improve the performance in great extent.

Before Performance Tuning

Before concluding that indexing is too slow, be sure that the cluster's hardware is fully utilized: use tools like iostat, top and ps to confirm CPU or IO is saturated across all nodes. If not then it needs more concurrent requests, but if EsRejectedExecutionException is thrown from the java client, or TOO_MANY_REQUESTS (429) HTTP response from REST requests, then it means that there are too many concurrent requests.

Since the settings discussed here are focused on maximizing indexing throughput for a single shard, it is best to first test just a single node, with a single shard and no replicas, to measure what a single Lucene index is capable of on your documents, and iterate on tuning that, before scaling it out to the entire cluster. This can also give a baseline to roughly estimate how many nodes it will need in the full cluster to meet your indexing throughput requirements.

Once single is shard working well, you can take full advantage of Elasticsearch's scalability and multiple nodes in your cluster by increasing the shard count and replica count.

1.       Limit the number of analyzed fields in the candidate.

Analyzed fields are passed through an analyzer to convert the string into a list of individual terms before being indexed. This reduces the indexing performance. The analysis process allows Elasticsearch to search for individual words within each full text field. Analyzed fields are not used for sorting and seldom used for aggregations.

(The string field is unsupported for indexes created in 5.x in favor of the text and keyword fields. Attempting to create a string field in an index created in 5.x will cause Elasticsearch to attempt to upgrade the string into the appropriate text or keyword field. Text is an analyzed field and keyword is not analyzed field.)

2.       Disable merge throttling.

Merge throttling is Elasticsearch’s automatic tendency to throttle indexing requests when it detects that merging is falling behind indexing. It makes sense to update cluster settings to disable merge throttling (by setting indices.store.throttle.type to “none”) if it is needed to optimize indexing performance, not search. You This could be made persistent (meaning it will persist after a cluster restart) or transient (resets back to default upon restart), based on the use case.

3.      Disable Refresh Interval

Increase the refresh interval in the Index Settings API. By default, the index refresh process occurs every second, but during heavy indexing periods, reducing the refresh frequency can help alleviate some of the workload.

4.       Increase translog flush threshold size

When a document is indexed in Elasticsearch, it is first written to write ahead log file called the translog. When the translog is flushed (by default is flushed after every index, delete, update, or bulk request, or when the translog becomes a certain size, or after a time interval) Elasticsearch then persists the data to disk during a Lucene commit, an expensive operation.
The translog helps prevent data loss in the event that a node fails. It is designed to help a shard recover operations that may otherwise have been lost between flushes.

Once the translog hits the index.translog.flush_threshold_size size, a flush will happen.

Index.translog.flush_threshold_size can be increased from the default 512 MB to something larger, such as 1 GB which allows larger segments to accumulate in the translog before a flush occurs. By letting larger segments build, flush happens less often, and the larger segments merge less often. All of this adds up to less disk I/O overhead and better indexing rates.

5.       Disable the number of replicas

When documents are replicated, the entire document is sent to the replica node and the indexing process is repeated verbatim. This means each replica will perform the analysis, indexing, and potentially merging process.
In contrast, if indexed with zero replicas and then enable replicas when ingestion is finished, the recovery process is essentially a byte-for-byte network transfer. This is much more efficient than duplicating the indexing process.

6.       Index ID – Using auto generated IDs

When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.

Note: This can improve the indexing performance greatly. 

7.       The number of nodes

There is no hard and fast rule for determining the number of nodes required for a cluster. It leads to start with a single node and then increase the number of nodes until you get the expected performance.
A node is a single server/running instance that is part of the cluster, stores data, and participates in the cluster’s indexing and search capabilities.
Once a single node has reached its maximum performance (CPU, Memory, IO) then a new node can be added to the cluster and the load can be balanced across the cluster. This is done through the elastic search client using Round Robin Strategy to balance the load against the nodes. Transport Client automatically does this. 

8.       Tweak the VM Options – Increase heap size

By default, Elasticsearch tells the JVM to use a heap with a minimum and maximum size of 2 GB. When moving to production, it is important to configure heap size to ensure that Elasticsearch has enough heap available. Elasticsearch will assign the entire heap specified in jvm.options via the Xms (minimum heap size) and Xmx (maximum heap size) settings.

The value for these setting depends on the amount of RAM available on your server. Good rules of thumb are:
  • Set the minimum heap size (Xms) and maximum heap size (Xmx) to be equal to each other.
  • The more heap available to Elasticsearch, the more memory it can use for caching. But note that too much heap can subject you to long garbage collection pauses.
  • Set Xmx to no more than 50% of your physical RAM, to ensure that there is enough physical RAM left for kernel file system caches. 


9.       Bulk processor tuning

Bulk indexing requests should be used for optimal performance. Bulk sizing is dependent on data, analysis, and cluster configuration, but a good starting point is 5–15 MB per bulk. Note that this is physical size. Document count is not a good metric for bulk size. For example, if 1,000 documents are indexed per bulk:
·         1,000 documents at 1 KB each is 1 MB.
·         1,000 documents at 100 KB each is 100 MB.
Those are drastically different bulk sizes. Bulks need to be loaded into memory at the coordinating node, so it is the physical size of the bulk that is more important than the document count.
Start with a bulk size around 5–15 MB and slowly increase it until there is no more performance gain. Then increasing the concurrency of the bulk ingestion should be started (multiple threads, and so forth).


Hope that helps.
Thank You.

References : 

https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
https://www.elastic.co/guide/en/elasticsearch/reference/master/heap-size.html

-->

Monday, March 20, 2017

Develop Your First REST Web Service with Java Spring

Hi All,

Let's develop our first REST web service. There are various libraries and methods you can follow to develop a REST API. But here we are going to use Spring framework to get this done in a much more easier manner. Here I am using IntelliJ IDEA 2016 to develop my application.

First let's go and create a maven project in Idea. Here we are not going to select a archetype. We will just use the existing template.

Let's update the pom.xml. We need to specify our spring dependencies and also the tomcat plugin to run our service locally. We will also need to add Jackson to do JSON mapping for us. Once you have update the pom file, it should should look like below. 


Once the dependencies are added to the project, you can check that by going into the project structure. 

Then we need to specify the WebApp folder where the necessary web resources are present. It's important that we adhere to the given structure, otherwise during the project build and packaging, we need to edit the pom to recognize where our resources exactly resides in. Project structure should look like below. 

Now let's create a package and add our REST controller class inside the package. REST controller is acting as an endpoint to the requests. It will get the requests as GET, POST and etc. according to the requests it will reply with the necessary output. To specify this class as a controller we need to use the '@RestController' annotation. Then we can optionally use the '@RequestMapping' to specify the path to our controller from the root. In order to include header for CORS( Cross-Origin Resource Sharing ) we need to use '@CrossOrigin' annotation. To read more about CORS you can refer to this link


Now we have setup our rest controller. Let's add some REST methods to get some work done with our controller. There are several rest methods like GET, POST, PUT, PATCH, DELETE. You can read about the REST methods here

Before adding the methods, lets create a simple Person class and a list of people inside our controller, just to test both GET and POST methods.


Now we can go and create our methods. Following is the complete code for the controller class.


We also need to add the servlet.xml and web.xml inside WEB-INF folder. 


Now let's run our program first and then look into the methods and what we have done in each. 

In order to run the program we can create new debug configuration in IntelliJ as following. Then you can click on the debug button to debug the program. 

In order to check the results it's good to install the Postman plugin for chrome. 

Now lets check our results while referring to the methods. 

The first method is a GET method that creates a Person and returns the object as the response. The person object is converted to a JSON object automatically with Jackson Data Binding. The method is just a simple GET method without any arguments passed into it. You can access the web service from the following URI. 

http://localhost:8080/JavaSpringRESTDemo/learning/newperson

If you check the result through Postman it will be as following. 


Second addPerson method is a POST method, that accepts a Person as a JSON object and adds the Person to a list. Here the Person object is again sent as a JSON object and it is mapped to a POJO by Jackson data binding. You can send the POST request as following. 

Once you send the request, response will be the existing list with the added Person as a JSON object. 

Now let's use a URL parameter to get the person, when the user ID is specified. URL parameter is specified by the annotation @RequestParam(value = "id"). So we have to specify the URL as following and get the output object. 

There is another way we can specify a parameter in the URL itself. It's by using a Path Variable. The final method is developed to accept a path variable containing a user ID. You can use the following URL to get the person with ID 15. 

So that's should be it for now. Hope to publish more with Spring Related stuff. 

Hope that helps. 
Thank you.

Sunday, March 19, 2017

Communicate Within Wars Inside Same Container

Hi All,

There are certain scenarios when you want to communicate within the wars inside the same container without using the network. For an example, you need to avoid any network delay that would cause either by using web-services,RMI or HTTP.

So what I'm going to show you is one working solution to achieve the above problem. What we can do is introduce a common library that is a dependent to both the services. Through the intermediate library we can continue the communication among the war files.

For this example, I'm using a service called, Front Service to accept the user request. Then a second service called 'Ground Service' that contains a method need to be called by the Front Service. The communication between the services are done with a JAR named 'Common-Lib'.

Common Lib has an interface which is similar to the Ground Service. It contains the method definitions of the Ground Service.



Common Lib also has a class that gets and sets the Service Instances registered to the jar.



Then let's develop the Front Service. This service is a simple Spring service, that gets a message from the Ground Service.



Finally we need to implement the Ground Service.
Here we have the Ground Service class and the helper classes to map a service instance to the Service Handler. There is a class named GroundServiceConfig, that sets an instance of the IGroundService interface to the Service Handler. IGroundService is implemented in the GroundServiceAdapter class.







Now we have the two services and the common library defined. It's important to note that although common-lib is a dependent of both the services, we do not bundle them with the service. Therefore in the POM.xml, we specify the 'common-lib' dependency as provided.



Therefore we need to deploy the common-lib in the 'lib' inside tomcat installation.

What happens here is that, when the Ground Service is initializing, it will assign a new instance of the IGroundService in the Service Handler. This instance can then be used to call the service methods from the other Services that uses 'Common-Lib' as a dependency.

Now let's deploy the two services inside the web-apps and put the common-lib inside the 'lib'

Let's go to the Front Service URL and check the result. As you can see we have got the message successfully from the GroundService.

The complete projects and code can be found here.

Hope that helps.
Thank You.