[英]Any efficient way to get unique terms from Elasticsearch index
I aim is to store all unique term along with their md5 hashes in a database.我的目标是将所有唯一术语及其 md5 哈希值存储在数据库中。 I have a 1 million document index which has ~400000 unique terms.
我有一个 100 万个文档索引,其中包含约 400000 个唯一术语。 I got this figure from using
aggregations
in elasticsearch.我通过在 elasticsearch 中使用
aggregations
得到了这个数字。
GET /dt_index/document/_search
{
"aggregations": {
"my_agg": {
"cardinality": {
"field": "text"
}
}
}
}
I can get the unique terms using the following:我可以使用以下方法获得独特的术语:
GET /dt_matrix/document/_search
{
"aggregations": {
"my_agg": {
"term": {
"field": "text",
"size": 100
}
}
}
}
This gives me 10 search results along with the term aggregation of 100 unique terms.这给了我 10 个搜索结果以及 100 个唯一术语的术语聚合。 But getting a JSON of ~400000 terms would require memory.
但是获得约 400000 项的 JSON 需要内存。 Just like for parsing through all the search results we can use
scan-scroll
.就像解析所有搜索结果一样,我们可以使用
scan-scroll
。 Is there any way I can parse through all unique terms without loading all in memory?有什么方法可以解析所有独特的术语而无需将所有术语加载到内存中?
You cant scan scroll through aggregation results.您无法扫描滚动浏览聚合结果。 Rather , you should index these unique terms in a separate index or type while indexing and then do a normal pagination over it.
相反,您应该在索引时在单独的索引或类型中索引这些独特的术语,然后对其进行正常的分页。
Although you can't scroll through aggregations, you can retrieve smaller, more memory manageable subsets by adding to your query request.尽管您无法滚动浏览聚合,但您可以通过添加到查询请求中来检索更小、更易于内存管理的子集。 For example, you can request all unique terms starting with the letter A, and so on.
例如,您可以请求以字母 A 开头的所有唯一术语,依此类推。 Adjust your query until you are satisfied with the size of the biggest subset.
调整您的查询,直到您对最大子集的大小感到满意。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.