Elasticsearch 尝试执行时的 gc 开销消息 Boolean 或查询一次检索所有匹配的文档

Question

I have been using Elasticsearch 7.6 and PHP client API for all the operations.我一直在使用 Elasticsearch 7.6 和 PHP 客户端 API 进行所有操作。 I have created elasticsearch index settings and mappings as follows我创建了 elasticsearch 索引设置和映射如下

$params = [
    'index' => $index,
    'body' => [
        'settings' => [
            "number_of_shards" => 1,
            "number_of_replicas" => 0,
            "index.queries.cache.enabled" => false,
            "index.soft_deletes.enabled" => false,
            "index.refresh_interval" => -1,
            "index.requests.cache.enable" => false,
            "index.max_result_window"=> 2000000
        ],
        'mappings' => [
            '_source' => [
                "enabled" => false
             ],
             'properties' => [
                "text" => [
                        "type" => "text",
                        "index_options" => "docs"
                ]
        ]
     ]
    ]
];

My Boolean OR search query is as follows我的 Boolean OR 搜索查询如下

$json = '{
"from" : 0, "size" : 2000000,
"query": {
       "bool": {
       "filter": {
        "match" : {
            "text" : {
            "query" : "apple orange grape banana",
            "operator" : "or"
            }
        }
    }
}
}
}';

I have indexed 2 million documents in such a way that all the documents match the query and I am also getting all the documents as expected.我已经为 200 万个文档建立了索引，所有文档都与查询相匹配，而且我也按预期获得了所有文档。 Since I am matching all the documents I have avoided scoring by using a filter in the bool query.由于我匹配所有文档，因此我通过在 bool 查询中使用过滤器来避免评分。

But in my log file, I am repetitively getting the following message until the query is finished executing.但是在我的日志文件中，我重复收到以下消息，直到查询执行完毕。 Sometimes I used to get the same message when indexing the documents in bulk有时我在批量索引文档时会收到相同的消息

[2020-05-15T19:15:45,720][INFO ][o.e.m.j.JvmGcMonitorService] [node1] [gc][14] overhead, spent [393ms] collecting in the last [1.1s]
[2020-05-15T19:15:47,822][INFO ][o.e.m.j.JvmGcMonitorService] [node1] [gc][16] overhead, spent [399ms] collecting in the last [1s]
[2020-05-15T19:15:49,827][INFO ][o.e.m.j.JvmGcMonitorService] [node1] [gc][18] overhead, spent [308ms] collecting in the last [1s]

I have given 16 GB for my heap memory. No other warnings are shown in the elasticsearch log.我已经为我的堆 memory 分配了 16 GB。elasticsearch 日志中没有显示其他警告。 What could be the reason for it?可能是什么原因呢？ or is it expected when retrieving a huge number of documents?.或者在检索大量文档时是否期望？ I understand about scroll API but I am curious about why this is happening when I use large value for index.max_result_window.我了解滚动 API 但我很好奇为什么当我对 index.max_result_window 使用大值时会发生这种情况。 Help is much appreciated?非常感谢帮助？ Thanks in advance!提前致谢！

Answer 1

What you see is normal behaviour for Elasticsearch with said configuration in particular, and any Java application in general.您看到的是 Elasticsearch 的正常行为，尤其是上述配置，以及任何 Java 应用程序的一般行为。

Is it normal for ES with big `index.max_result_window` ? `index.max_result_window`大的 ES 正常吗？

Yes.是的。 As the docs on index.max_result_window state, the amount of garbage generated is proportional to the number of documents returned by the query:正如index.max_result_window state 上的文档，生成的垃圾量与查询返回的文档数成正比：

Search requests take heap memory and time proportional to from + size and this limits that memory.搜索请求占用堆 memory，时间与 from + size 成正比，这限制了 memory。

Does it apply also for bulk API requests?它是否也适用于批量 API 请求？

Yes, if your bulk request is large in size, it might trigger garbage collection.是的，如果您的批量请求很大，它可能会触发垃圾回收。

Naturally, ES allocates the documents it needs to send back to the user on heap, immediately after that they become garbage and thus subject for garbage collection.自然地，ES 会在堆上分配它需要发回给用户的文档，之后它们立即变成垃圾，因此需要进行垃圾回收。

How does garbage collection work in Java?垃圾收集在 Java 中是如何工作的？

You may find some relevant information for example here .例如，您可以在此处找到一些相关信息。

Is there a better way to query for all matching documents?有没有更好的方法来查询所有匹配的文档？

There is, for example, match_all query.例如，有match_all查询。

How is it better compared to making all documents match certain query?与使所有文档匹配特定查询相比，它有何优势？ Elasticsearch does not have to query indexes and can go and fetch the documents right away (better performance and resource use). Elasticsearch 不必查询索引，可以 go 并立即获取文档（更好的性能和资源使用）。

Should I use scroll API, or is current approach good enough?我应该使用滚动条 API，还是当前的方法足够好？

Scroll API is the recommended way since it scales far beyond the memory capacity of one Elasticsearch node (one can download 1TB of data from a cluster of few machines with some 16GB of RAM).滚动 API 是推荐的方式，因为它的规模远远超过了一个 Elasticsearch 节点的 memory 容量（一个人可以从具有大约 16GB RAM 的几台机器的集群中下载 1TB 的数据）。

However, if you want to still be using the normal search queries, you may consider using from and size parameters and do pagination (so limiting the amount of documents fetched per query, and making GC better spread over time).但是，如果您仍想使用普通的搜索查询，您可以考虑使用from和size参数并进行分页（这样可以限制每次查询获取的文档数量，并使 GC 更好地随时间分散）。

Hope this helps!希望这可以帮助！

Elasticsearch 尝试执行时的 gc 开销消息 Boolean 或查询一次检索所有匹配的文档

问题描述

1 个解决方案

解决方案1
1 2020-05-16 15:16:35

Is it normal for ES with big `index.max_result_window` ? `index.max_result_window`大的 ES 正常吗？

Does it apply also for bulk API requests?它是否也适用于批量 API 请求？

How does garbage collection work in Java?垃圾收集在 Java 中是如何工作的？

Is there a better way to query for all matching documents?有没有更好的方法来查询所有匹配的文档？

Should I use scroll API, or is current approach good enough?我应该使用滚动条 API，还是当前的方法足够好？

Elasticsearch 尝试执行时的 gc 开销消息 Boolean 或查询一次检索所有匹配的文档

问题描述

1 个解决方案

解决方案1 1 2020-05-16 15:16:35

Is it normal for ES with big index.max_result_window ? index.max_result_window大的 ES 正常吗？

Does it apply also for bulk API requests?它是否也适用于批量 API 请求？

How does garbage collection work in Java?垃圾收集在 Java 中是如何工作的？

Is there a better way to query for all matching documents?有没有更好的方法来查询所有匹配的文档？

Should I use scroll API, or is current approach good enough?我应该使用滚动条 API，还是当前的方法足够好？

解决方案1
1 2020-05-16 15:16:35

Is it normal for ES with big `index.max_result_window` ? `index.max_result_window`大的 ES 正常吗？