简体   繁体   English

Elasticsearch python API:通过查询删除文档

[英]Elasticsearch python API: Delete documents by query

I see that the following API will do delete by query in Elasticsearch - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html我看到以下 API 将在 Elasticsearch 中按查询删除 - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

But I want to do the same with the elastic search bulk API, even though I could use bulk to upload docs using但我想对弹性搜索批量 API 做同样的事情,即使我可以使用批量上传文档

es.bulk(body=json_batch)

I am not sure how to invoke delete by query using the python bulk API for Elastic search.我不确定如何使用用于弹性搜索的 python 批量 API 来通过查询调用删除。

Seeing as how elasticsearch has deprecated the delete by query API. 看看elasticsearch如何通过查询API弃用删除。 I created this python script using the bindings to do the same thing. 我使用绑定创建了这个python脚本来做同样的事情。 First thing define an ES connection: 首先定义ES连接:

import elasticsearch
es = elasticsearch.Elasticsearch(['localhost'])

Now you can use that to create a query for results you want to delete. 现在,您可以使用它来为要删除的结果创建查询。

search=es.search(
    q='The Query to ES.',
    index="*logstash-*",
    size=10,
    search_type="scan",
    scroll='5m',
)

Now you can scroll that query in a loop. 现在,您可以循环滚动该查询。 Generate our request while we do it. 在我们这样做时生成我们的请求。

 while True:
    try: 
      # Git the next page of results. 
      scroll=es.scroll( scroll_id=search['_scroll_id'], scroll='5m', )
    # Since scroll throws an error catch it and break the loop. 
    except elasticsearch.exceptions.NotFoundError: 
      break 
    # We have results initialize the bulk variable. 
    bulk = ""
    for result in scroll['hits']['hits']:
      bulk = bulk + '{ "delete" : { "_index" : "' + str(result['_index']) + '", "_type" : "' + str(result['_type']) + '", "_id" : "' + str(result['_id']) + '" } }\n'
    # Finally do the deleting. 
    es.bulk( body=bulk )

To use the bulk api you need to ensure two things: 要使用批量api,您需要确保两件事:

  1. The document is identified You want to update. 文档已标识您要更新。 (index, type, id) (索引,类型,id)
  2. Each request is terminated with a newline or /n. 每个请求都以换行符或/ n终止。

The elasticsearch-py bulk API does allow you to delete records in bulk by including '_op_type': 'delete' in each record. elasticsearch-py批量API允许您通过在每条记录中包含'_op_type': 'delete'来批量'_op_type': 'delete'记录。 However, if you want to delete-by-query you still need to make two queries: one to fetch the records to be deleted, and another to delete them. 但是,如果要逐个删除,则仍需要进行两个查询:一个用于获取要删除的记录,另一个用于删除它们。

The easiest way to do this in bulk is to use python module's scan() helper, which wraps the ElasticSearch Scroll API so you don't have to keep track of _scroll_id s. 批量执行此操作的最简单方法是使用python模块的scan()帮助程序,它包装ElasticSearch Scroll API,因此您无需跟踪_scroll_id Use it with the bulk() helper as a replacement for the deprecated delete_by_query() : 将它与bulk()帮助器一起用作已弃用的delete_by_query()的替代:

from elasticsearch.helpers import bulk, scan

bulk_deletes = []
for result in scan(es,
                   query=es_query_body,  # same as the search() body parameter
                   index=ES_INDEX,
                   doc_type=ES_DOC,
                   _source=False,
                   track_scores=False,
                   scroll='5m'):

    result['_op_type'] = 'delete'
    bulk_deletes.append(result)

bulk(elasticsearch, bulk_deletes)

Since _source=False is passed, the document body is not returned so each result is pretty small. 由于传递了_source=False ,因此不会返回文档正文,因此每个结果都非常小。 However, if do you have memory constraints, you can batch this pretty easily: 但是,如果您有内存限制,则可以非常轻松地批量处理:

BATCH_SIZE = 100000

i = 0
bulk_deletes = []
for result in scan(...):

    if i == BATCH_SIZE:
        bulk(elasticsearch, bulk_deletes)
        bulk_deletes = []
        i = 0

    result['_op_type'] = 'delete'
    bulk_deletes.append(result)

    i += 1

bulk(elasticsearch, bulk_deletes)

I'm currently using this script based on @drs response, but using bulk() helper consistently. 我目前正在使用基于@drs响应的脚本,但始终使用bulk()帮助程序。 It has the ability to create batchs of jobs from a iterator by using chunk_size parameter (defaults to 500, see straming_bulk() for more info). 它可以使用chunk_size参数从迭代器创建批量作业(默认为500,有关详细信息,请参阅straming_bulk() )。

from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan, bulk

BULK_SIZE = 1000

def stream_items(es, query):
    for e in scan(es, 
                  query=query, 
                  index=ES_INDEX,
                  doc_type=ES_DOCTYPE, 
                  scroll='1m',
                  _source=False):

        # There exists a parameter to avoid this del statement (`track_source`) but at my version it doesn't exists.
        del e['_score']
        e['_op_type'] = 'delete'
        yield e

es = Elasticsearch(host='localhost')
bulk(es, stream_items(es, query), chunk_size=BULK_SIZE)

Thanks, this was really useful! 谢谢,这真的很有用!

I have two suggestions: 我有两个建议:

  1. When getting the next page of results with scroll, es.scroll(scroll_id=search['_scroll_id']) should be the _scroll_id returned in the last scroll, not the one the search returned. 当使用scroll获取结果的下一页时, es.scroll(scroll_id=search['_scroll_id'])应该是在最后一个滚动中返回的_scroll_id ,而不是搜索返回的那个。 Elasticsearch does not update the scroll ID every time, especially with smaller requests (see this discussion ), so this code might work, but it's not foolproof. Elasticsearch不会每次都更新滚动ID,特别是对于较小的请求(请参阅此讨论 ),因此此代码可能有效,但并非万无一失。

  2. It's important to clear scrolls as keeping search contexts open for a long time has a cost. 清除卷轴很重要,因为长时间保持搜索上下文打开会产生成本。 Clear Scroll API - Elasticsearch API documentation They will close eventually after timeout, but if you're low on disk space for example, it can save you a lot of headache. Clear Scroll API - Elasticsearch API文档它们将在超时后最终关闭,但如果您的磁盘空间不足,例如,它可以为您节省很多麻烦。

An easy way is to build a list of scroll IDs on the go (make sure to get rid of duplicates!), and clear everything in the end. 一种简单的方法是在旅途中建立一个滚动ID列表(确保摆脱重复!),并最终清除所有内容。

es.clear_scroll(scroll_id=scroll_id_list)

While operationally equivalent to many other answers, I find this way more accessible:虽然在操作上等同于许多其他答案,但我发现这种方式更容易理解:

import elasticsearch
from elasticsearch.helpers import bulk

es = elasticsearch.Elasticsearch(['localhost'])

ids = [1,2,3, ...]      # list of ids that will be deleted
index = "fake_name"     # index where the documents are indexed

actions = ({
    "_id": id,
    "_op_type": "delete"
} for id in ids)

bulk(client=es, actions=actions, index=index, refresh=True)
# `refresh=True` makes the result immediately available

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM