简体繁体 English

ElasticSearch如何处理具有2.3亿个条目的索引？

[英]How does ElasticSearch handle an index with 230m entries?

原文 2019-08-10 21:31:54 8 1 elasticsearch/ logstash/ elastic-stack/ elk

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. 我一直在寻找Elasticsearch，并注意到可以创建索引并批量添加项目。 I currently have a series of flat files with 220 million entries. 我目前拥有一系列带有2.2亿条目的平面文件。 I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. 我正在使用Logstash进行分析并将其添加到ElasticSearch，但是我觉得它在1索引以下的存在将很难查询。 The row data is nothing more than 1-3 properties at most. 行数据最多不过是1-3个属性。

How does Elasticsearch function in this case? 在这种情况下，Elasticsearch如何起作用？ In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set? 为了有效地查询该索引，您是否只是将其他实例添加到集群中，它们将一起工作以处理集合？

I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does. 我一直在浏览文档，它在解释要做的事情，但不一定总是在解释为什么要这样做。

1 个解决方案

In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set? 为了有效地查询该索引，您是否只是将其他实例添加到集群中，它们将一起工作以处理集合？

That is exactly what you need to do. 那正是您需要做的。 Typically it's an iterative process: 通常，这是一个迭代过程：

start by putting a subset of the data in. You can also put in all the data, if time and cost permit. 首先放入一部分数据。如果时间和成本允许，还可以放入所有数据。
put some search load on it that is as close as possible to production conditions, eg by turning on whatever search integration you're planning to use. 在搜索负载上增加尽可能接近生产条件的负载，例如，打开您计划使用的任何搜索集成。 If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results. 如果您打算只手动发出查询，那么现在该尝试一下并评估其速度和结果的相关性了。
see if the queries are particularly slow and if their results are relevant enough. 查看查询是否特别慢，以及查询结果是否足够相关。 You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster. 您可以更改所使用的索引映射或查询以获得更快的结果，并确实向群集中添加更多节点。

Since you mention Logstash, there are a few things that may help further: 既然您提到Logstash，那么有些事情可能会进一步帮助您：

check out Filebeat for indexing the data on an ongoing basis. 检出Filebeat以持续索引数据。 You may not need to do the work of reading the files and bulk indexing yourself. 您可能不需要自己进行文件读取和批量索引的工作。
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (eg index-2019-08-11, index-2019-08-12, index-2019-08-13). 如果是日志数据或类似日志的数据，并且您对最近的结果最感兴趣，则按日期和时间划分数据可能会快得多（例如index-2019-08-11，index-2019-08- 12，index-2019-08-13）。 See the Index Lifecycle Management feature for automating this. 请参阅索引生命周期管理功能以实现此自动化。
try using the Keyword field type where appropriate in your mappings. 请尝试在映射中的适当位置使用关键字字段类型。 It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. 它将停止对字段的分析，从而阻止您在字段内进行全文搜索，并且仅允许完全匹配的字符串。 Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values. 对于诸如[tags]字段或[status]字段之类的字段很有用，例如[[draft]，“ review”，“ published”]值。