ElasticSearch 检索文档慢

Question

I'm using Java_API to retrieve records from ElasticSearch, it needs approximately 5 second to retrieve 100000 document (record/row) in Java application.我正在使用 Java_API 从 ElasticSearch 检索记录，在 Java 应用程序中检索 100000 个文档（记录/行）大约需要 5 秒。

Is it slow for ElasticSearch? ElasticSearch 速度慢吗？ or is it normal?还是正常的？

Here is the index settings:这是索引设置：

I tried to get better performance but without result, here is what I did:我试图获得更好的性能但没有结果，这是我所做的：

Set ElasticSearch heap space to 3GB it was 1GB(default) -Xms3g -Xmx3g将 ElasticSearch 堆空间设置为 3GB，它是 1GB（默认） -Xms3g -Xmx3g
Migrate the ElasticSearch on SSD from 7200 RPM Hard Drive从 7200 RPM 硬盘迁移 SSD 上的 ElasticSearch
Retrieve only one filed instead of 30只检索一个归档而不是 30

Here is my Java Implementation Code这是我的 Java 实现代码

private void getDocuments() {
        int counter = 1;
        try {
            lgg.info("started");
            TransportClient client = new PreBuiltTransportClient(Settings.EMPTY)
                    .addTransportAddress(new TransportAddress(InetAddress.getByName("localhost"), 9300));

            SearchResponse scrollResp = client.prepareSearch("ebpp_payments_union").setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
                    .setQuery(QueryBuilders.matchAllQuery())                 
                    .setScroll(new TimeValue(1000))
                    .setFetchSource(new String[] { "payment_id" }, null)
                    .setSize(10000)
                    .get();

            do {
                for (SearchHit hit : scrollResp.getHits().getHits()) {
                    if (counter % 100000 == 0) {
                        lgg.info(counter + "--" + hit.getSourceAsString());
                    }
                    counter++;
                }

                scrollResp = client.prepareSearchScroll(scrollResp.getScrollId())
                        .setScroll(new TimeValue(60000))
                        .execute()
                        .actionGet();
            } while (scrollResp.getHits().getHits().length != 0);

            client.close();
        } catch (UnknownHostException e) {
            e.printStackTrace();
        }
    }

I know that TransportClient is deprecated, I tried by RestHighLevelClient also, but it does not changes anything.我知道不推荐使用TransportClient ，我也尝试过RestHighLevelClient ，但它没有改变任何东西。

Do you know how to get better performance?你知道如何获得更好的性能吗？

Should I change something in ElasticSearch or modify my Java code?我应该更改 ElasticSearch 中的某些内容还是修改我的 Java 代码？

Answer 1

Performance troubleshooting/tuning is hard to do with out understanding all of the stuff involved but that does not seem very fast.如果不了解所涉及的所有内容，就很难进行性能故障排除/调整，但这似乎不是很快。 Because this is a single node cluster you're going to run into some performance issues.因为这是一个单节点集群，您将遇到一些性能问题。 If this was a production cluster you would have at least a replica for each shard which can also be used for reading.如果这是一个生产集群，则每个分片至少有一个副本，也可用于读取。

A few other things you can do:您还可以做一些其他事情：

Index your documents based on your most frequently searched attribute - this will write all of the documents with the same attribute to the same shard so ES does less work reading (This won't help you since you have a single shard)根据您最常搜索的属性索引您的文档 - 这会将具有相同属性的所有文档写入同一个分片，因此 ES 的读取工作更少（这对您没有帮助，因为您只有一个分片）
Add multiple replica shards so you can fan out the reads across nodes in the cluster (once again, need to actually have a cluster)添加多个副本分片，以便您可以在集群中的节点之间分散读取（再次，需要实际拥有一个集群）
Don't have the master role on the same boxes as your data - if you have a moderate or large cluster you should have boxes that are neither master nor data but are the boxes your app connects to so they can manage the meta work for the searches and let the data nodes focus on data.不要在与数据相同的盒子上担任主角色——如果你有一个中等或大型集群，你应该有既不是主也不是数据的盒子，而是你的应用程序连接到的盒子，以便他们可以管理元工作搜索并让数据节点专注于数据。
Use "query_then_fetch" - unless you are using weighted searches, then you should probably stick with DFS.使用“query_then_fetch” - 除非您使用加权搜索，否则您应该坚持使用 DFS。

Answer 2

I see three possible axes for optimizations:我看到了三个可能的优化轴：

1/ sort your documents on _doc key : 1/ 在 _doc 键上对文档进行排序：

Scroll requests have optimizations that make them faster when the sort order is _doc.当排序顺序为 _doc 时，滚动请求具有优化，可以使它们更快。 If you want to iterate over all documents regardless of the order, this is the most efficient option:如果您想遍历所有文档而不考虑顺序，这是最有效的选项：

( documentation source ) （文档来源）

2/ reduce your page size, 10000 seems a high value. 2/ 减小页面大小，10000 似乎是一个很高的值。 Can you make differents test with reduced values like 5000 /1000?你能用 5000 /1000 等减少的值进行差异测试吗？

3/ Remove the source filtering 3/ 去除源过滤

.setFetchSource(new String[] { "payment_id" }, null) .setFetchSource(new String[] { "payment_id" }, null)

It can be heavy to make source filtering, since the elastic node needs to read the source, transformed in Object and then filtered.进行源过滤可能很繁重，因为弹性节点需要读取源，在 Object 中进行转换，然后进行过滤。 So can you try to remove this?那么你可以尝试删除它吗？ The network load will increase but its a trade :)网络负载会增加，但这是一笔交易:)

ElasticSearch 检索文档慢

问题描述

2 个解决方案

解决方案1
1 2019-07-11 14:07:24

解决方案2
1 2019-07-15 12:39:03

ElasticSearch 检索文档慢

问题描述

2 个解决方案

解决方案1 1 2019-07-11 14:07:24

解决方案2 1 2019-07-15 12:39:03

解决方案1
1 2019-07-11 14:07:24

解决方案2
1 2019-07-15 12:39:03