简体   繁体   中英

ElasticSearch retrieves documents slowly

I'm using Java_API to retrieve records from ElasticSearch, it needs approximately 5 second to retrieve 100000 document (record/row) in Java application.

Is it slow for ElasticSearch? or is it normal?

Here is the index settings:

在此处输入图片说明

I tried to get better performance but without result, here is what I did:

  • Set ElasticSearch heap space to 3GB it was 1GB(default) -Xms3g -Xmx3g

  • Migrate the ElasticSearch on SSD from 7200 RPM Hard Drive

  • Retrieve only one filed instead of 30

Here is my Java Implementation Code

private void getDocuments() {
        int counter = 1;
        try {
            lgg.info("started");
            TransportClient client = new PreBuiltTransportClient(Settings.EMPTY)
                    .addTransportAddress(new TransportAddress(InetAddress.getByName("localhost"), 9300));

            SearchResponse scrollResp = client.prepareSearch("ebpp_payments_union").setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
                    .setQuery(QueryBuilders.matchAllQuery())                 
                    .setScroll(new TimeValue(1000))
                    .setFetchSource(new String[] { "payment_id" }, null)
                    .setSize(10000)
                    .get();

            do {
                for (SearchHit hit : scrollResp.getHits().getHits()) {
                    if (counter % 100000 == 0) {
                        lgg.info(counter + "--" + hit.getSourceAsString());
                    }
                    counter++;
                }

                scrollResp = client.prepareSearchScroll(scrollResp.getScrollId())
                        .setScroll(new TimeValue(60000))
                        .execute()
                        .actionGet();
            } while (scrollResp.getHits().getHits().length != 0);

            client.close();
        } catch (UnknownHostException e) {
            e.printStackTrace();
        }
    }

I know that TransportClient is deprecated, I tried by RestHighLevelClient also, but it does not changes anything.

Do you know how to get better performance?

Should I change something in ElasticSearch or modify my Java code?

Performance troubleshooting/tuning is hard to do with out understanding all of the stuff involved but that does not seem very fast. Because this is a single node cluster you're going to run into some performance issues. If this was a production cluster you would have at least a replica for each shard which can also be used for reading.

A few other things you can do:

  • Index your documents based on your most frequently searched attribute - this will write all of the documents with the same attribute to the same shard so ES does less work reading (This won't help you since you have a single shard)
  • Add multiple replica shards so you can fan out the reads across nodes in the cluster (once again, need to actually have a cluster)
  • Don't have the master role on the same boxes as your data - if you have a moderate or large cluster you should have boxes that are neither master nor data but are the boxes your app connects to so they can manage the meta work for the searches and let the data nodes focus on data.
  • Use "query_then_fetch" - unless you are using weighted searches, then you should probably stick with DFS.

I see three possible axes for optimizations:

1/ sort your documents on _doc key :

Scroll requests have optimizations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:

( documentation source )

2/ reduce your page size, 10000 seems a high value. Can you make differents test with reduced values like 5000 /1000?

3/ Remove the source filtering

.setFetchSource(new String[] { "payment_id" }, null)

It can be heavy to make source filtering, since the elastic node needs to read the source, transformed in Object and then filtered. So can you try to remove this? The network load will increase but its a trade :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM