简体   繁体   English

使用 Python 批量更新 elasticsearch 文档

[英]Bulk Update for elasticsearch documents using Python

I have elasticsearch documents like below where I need to rectify age value based on creationtime currentdate我有 elasticsearch 文档,如下所示,我需要根据创建时间currentdate更正年龄

age = creationtime - currentdate年龄 = 创建时间 - 当前日期

:

hits = [
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2018-05-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2013-07-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2014-08-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2015-09-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   }
]

I want to do bulk update based on each document ID, but the problem is I need to correct 6 months of data & per data size (doc count of Index) is almost 535329 , I want to efficiently do bulk update on age based on _id for each day on all documents using python.我想根据每个文档 ID 进行批量更新,但问题是我需要更正 6 个月的数据并且每个数据大小(索引的文档计数)几乎是535329 ,我想有效地根据_id年龄进行批量更新对于使用 python 的所有文档的每一天。

Is there a way to do this, without looping through, all examples I came across using Pandas dataframes for update is based on a known value.有没有办法做到这一点,而无需循环,我遇到的所有使用 Pandas 数据帧进行更新的示例都基于已知值。 But here _id I will get as and when the code runs.但是在这里_id我会在代码运行时得到。

The logic I had written was to fetch all doc & store their _id & then for each _id update the age .我编写的逻辑是获取所有文档并存储它们的_id ,然后为每个_id更新年龄 But its not an efficient way if I want to update all documents in bulk for each day of 6 months.但如果我想在 6 个月的每一天批量更新所有文档,这不是一种有效的方法。

Can anyone give me some ideas for this or point me in the right direction.谁能给我一些想法或指出正确的方向。

As mentioned in the comments, fetching the IDs won't be necessary.如评论中所述,无需获取 ID。 You don't even need to fetch the documents themselves!您甚至不需要自己获取文件!

A single _update_by_query call will be enough.一个_update_by_query调用就足够了。 You can use ChronoUnit to get the difference after you've parsed the dates:解析日期后,您可以使用ChronoUnit来获得差异:

POST your-index-name/_update_by_query
{
  "query": {
    "match_all": {}
  },
  "script": {
    "source": """
      def created =  LocalDateTime.parse(ctx._source.creationtime, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss"));

      def currentdate = LocalDateTime.parse(ctx._source.currentdate, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
    
      def months = ChronoUnit.MONTHS.between(created, currentdate);
      ctx._source._age = months + ' month' + (months > 1 ? 's' : '');
    """,
    "lang": "painless"
  }
}

The official python client has this method too . python 官方客户端也有这个方法 Here's a working example .这是一个工作示例

Try running this update script on a small subset of your documents before letting in out on your whole index by adding a query other than the match_all I put there.尝试在文档的一小部分上运行此更新脚本,然后通过添加除我放在那里的match_all之外的查询来释放整个索引。


It's worth mentioning that unless you search on this age field, it doesn't need to be stored in your index because it can be calculated at query time.值得一提的是,除非你在这个age字段上进行搜索,否则它不需要存储在你的索引中,因为它可以在查询时计算出来。

You see, if your index mapping's dates are properly defined like so:你看,如果你的索引映射的日期是这样正确定义的:

{
  "mappings": {
    "properties": {
      "creationtime": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ss"
      },
      "currentdate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      },
      ...
    }
  }
}

the age can be calculated as a script field : age可以计算为脚本字段

POST ttimes/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "age_calculated": {
      "script": {
        "source": """
          def months = ChronoUnit.MONTHS.between(
                          doc['creationtime'].value,
                          doc['currentdate'].value );
          return months + ' month' + (months > 1 ? 's' : '');
        """
      }
    }
  }
}

The only caveat is, the value won't be inside of the _source but rather inside of its own group called fields (which implies that more script fields are possible at once.).唯一需要注意的是,该值不会在_source内,而是在其自己的称为fields的组内(这意味着一次可能有更多脚本字段。)。

"hits" : [
  {
    ...
    "_id" : "FFfPuncBly0XYOUcdIs5",
    "fields" : {
      "age_calculated" : [ "32 months" ]   <--
    }
  },
  ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM