定期处理和更新 elasticsearch 索引中的文档

Question

I need to come up with a strategy to process and update documents in an elasticsearch index periodically and efficiently.我需要想出一个策略来定期有效地处理和更新 elasticsearch 索引中的文档。 I do not have to look at documents that I processed before.我不必查看我之前处理过的文件。

My setting is that I have a long running process, which continuously inserts documents to an index, say approx.我的设置是我有一个长时间运行的过程，它不断地将文档插入索引，比如大约。 500 documents per hour (think about the common logging example).每小时 500 个文档（想想常见的日志记录示例）。

I need to find a solution to update some amount of documents periodically (via cron job, eg) to run some code on a specific field (text field, eg.) to enhance that document with a number of new fields.我需要找到一种解决方案来定期更新一些文档（例如通过 cron 作业）以在特定字段（例如文本字段）上运行一些代码，以使用许多新字段来增强该文档。 I want to do this to offer more fine grained aggregations on the index.我想这样做是为了在索引上提供更细粒度的聚合。 In the logging analogy, this could be, eg, I get the UserAgent-string from a log entry (document), do some parsing on that, and add some new fields back to that document and index it.在日志类比中，这可能是，例如，我从日志条目（文档）中获取 UserAgent 字符串，对其进行一些解析，然后将一些新字段添加回该文档并为其编制索引。

So my approach would be:所以我的方法是：

Get some amount of documents (or even all) that I haven't looked at before.获取一些我以前没有看过的文件（甚至全部）。 I could query them by combining must_not and exists , for instance.例如，我可以通过组合must_not和exists来查询它们。
Run my code on these documents (run the parser, compute some new stuff, whatever).在这些文档上运行我的代码（运行解析器，计算一些新的东西，等等）。
Update the documents obtained previously (probably most preferably via bulk api).更新之前获得的文档（可能最好通过批量 api）。

I know there is the Update by query API .我知道有查询 API 的更新。 But this does not seem to be right here, since I need to run my own code (which btw depends on external libraries), on my server and not as a painless script, which would not offer that comprehensive tasks I need.但这似乎不在这里，因为我需要在我的服务器上运行我自己的代码（顺便说一句取决于外部库），而不是作为一个无痛的脚本，它不会提供我需要的全面任务。

I am accessing elasticsearch via python .我正在通过python 访问 elasticsearch 。

The problem is now that I don't know how to implement the above approach.现在的问题是我不知道如何实现上述方法。 Eg what if the amount of document obtained in step 1. is larger than myindex.settings.index.max_result_window ?例如，如果在步骤 1 中获得的文档量大于myindex.settings.index.max_result_window怎么办？

Any ideas?有任何想法吗？

Answer 1

I considered @Jay's comment and ended up with this pattern, for the moment:我考虑了@Jay 的评论并最终得出了这种模式：

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers import scan

from my_module.postprocessing import post_process_doc

es = Elasticsearch(...)
es.ping()

def update_docs( docs ):
    """"""
    for idx,doc in enumerate(docs):
        if idx % 10000 == 0:
            print( 'next 10k' )
        
        new_field_value = post_process_doc( doc )

        doc_update = {
            "_index": doc["_index"],
            "_id" : doc["_id"],
            "_op_type" : "update",
            "doc" : { <<the new field>> : new_field_value }
        }

        yield doc_update

docs = scan( es, query='{ "query" : { "bool": { "must_not": { "exists": { "field": <<the new field>> }} } }}', index=index, scroll="1m", preserve_order=True )

bulk( es, update_docs( docs ) )

Comments:注释：

I learned that elasticsearch keeps a view of the search results when you do a scroll and pass the corresponding ids with the query request.我了解到 elasticsearch 会在您进行滚动并通过查询请求传递相应的 ID 时查看搜索结果。 The scan abstraction method will handle that for you. 扫描抽象方法将为您处理。 The scroll-parameter in the method above tells elasticsearch how long the view will be open, ie, how long the view will be consistant.上述方法中的滚动参数告诉 elasticsearch 视图将打开多长时间，即视图保持一致的时间。
As stated in my comment the documentation says that they no longer recommend using the scroll API for deep pagination.正如我在评论中所述，文档说他们不再推荐使用滚动 API 进行深度分页。 If you need to preserve the index state while paging use.. point in time (PIT) , but I haven't tried it yet. 如果您需要在分页使用时保留索引 state .. 时间点 (PIT) ，但我还没有尝试过。
In my implementation, I needed to pass preserve_over=True , otherwise an error was thrown.在我的实现中，我需要传递preserve_over=True ，否则会引发错误。
Remember to update the mapping of the index beforehand, eg, when you want to add a nested fields as another field in your document.请记住事先更新索引的映射，例如，当您想在文档中添加嵌套字段作为另一个字段时。

定期处理和更新 elasticsearch 索引中的文档

问题描述

1 个解决方案

解决方案1
0 2022-01-12 10:21:37

定期处理和更新 elasticsearch 索引中的文档

问题描述

1 个解决方案

解决方案1 0 2022-01-12 10:21:37

解决方案1
0 2022-01-12 10:21:37