[英]Periodically process and update documents in elasticsearch index
I need to come up with a strategy to process and update documents in an elasticsearch index periodically and efficiently.我需要想出一个策略来定期有效地处理和更新 elasticsearch 索引中的文档。 I do not have to look at documents that I processed before.
我不必查看我之前处理过的文件。
My setting is that I have a long running process, which continuously inserts documents to an index, say approx.我的设置是我有一个长时间运行的过程,它不断地将文档插入索引,比如大约。 500 documents per hour (think about the common logging example).
每小时 500 个文档(想想常见的日志记录示例)。
I need to find a solution to update some amount of documents periodically (via cron job, eg) to run some code on a specific field (text field, eg.) to enhance that document with a number of new fields.我需要找到一种解决方案来定期更新一些文档(例如通过 cron 作业)以在特定字段(例如文本字段)上运行一些代码,以使用许多新字段来增强该文档。 I want to do this to offer more fine grained aggregations on the index.
我想这样做是为了在索引上提供更细粒度的聚合。 In the logging analogy, this could be, eg, I get the UserAgent-string from a log entry (document), do some parsing on that, and add some new fields back to that document and index it.
在日志类比中,这可能是,例如,我从日志条目(文档)中获取 UserAgent 字符串,对其进行一些解析,然后将一些新字段添加回该文档并为其编制索引。
So my approach would be:所以我的方法是:
must_not
and exists
, for instance.must_not
和exists
来查询它们。 I know there is the Update by query API .我知道有查询 API 的更新。 But this does not seem to be right here, since I need to run my own code (which btw depends on external libraries), on my server and not as a painless script, which would not offer that comprehensive tasks I need.
但这似乎不在这里,因为我需要在我的服务器上运行我自己的代码(顺便说一句取决于外部库),而不是作为一个无痛的脚本,它不会提供我需要的全面任务。
I am accessing elasticsearch via python .我正在通过python 访问 elasticsearch 。
The problem is now that I don't know how to implement the above approach.现在的问题是我不知道如何实现上述方法。 Eg what if the amount of document obtained in step 1. is larger than
myindex.settings.index.max_result_window
?例如,如果在步骤 1 中获得的文档量大于
myindex.settings.index.max_result_window
怎么办?
Any ideas?有任何想法吗?
I considered @Jay's comment and ended up with this pattern, for the moment:我考虑了@Jay 的评论并最终得出了这种模式:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers import scan
from my_module.postprocessing import post_process_doc
es = Elasticsearch(...)
es.ping()
def update_docs( docs ):
""""""
for idx,doc in enumerate(docs):
if idx % 10000 == 0:
print( 'next 10k' )
new_field_value = post_process_doc( doc )
doc_update = {
"_index": doc["_index"],
"_id" : doc["_id"],
"_op_type" : "update",
"doc" : { <<the new field>> : new_field_value }
}
yield doc_update
docs = scan( es, query='{ "query" : { "bool": { "must_not": { "exists": { "field": <<the new field>> }} } }}', index=index, scroll="1m", preserve_order=True )
bulk( es, update_docs( docs ) )
Comments:注释:
preserve_over=True
, otherwise an error was thrown.preserve_over=True
,否则会引发错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.