使用 Python 批量更新 elasticsearch 文档

Question

I have elasticsearch documents like below where I need to rectify age value based on creationtime currentdate我有 elasticsearch 文档，如下所示，我需要根据创建时间currentdate更正年龄值

age = creationtime - currentdate年龄 = 创建时间 - 当前日期

: ：

hits = [
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2018-05-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2013-07-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2014-08-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   },
   {
      "_id":"CrRvuvcC_uqfwo-WSwLi",
      "creationtime":"2015-09-20T20:57:02",
      "currentdate":"2021-02-05 00:00:00",
      "age":"60 months"
   }
]

I want to do bulk update based on each document ID, but the problem is I need to correct 6 months of data & per data size (doc count of Index) is almost 535329 , I want to efficiently do bulk update on age based on _id for each day on all documents using python.我想根据每个文档 ID 进行批量更新，但问题是我需要更正 6 个月的数据并且每个数据大小（索引的文档计数）几乎是535329 ，我想有效地根据_id对年龄进行批量更新对于使用 python 的所有文档的每一天。

Is there a way to do this, without looping through, all examples I came across using Pandas dataframes for update is based on a known value.有没有办法做到这一点，而无需循环，我遇到的所有使用 Pandas 数据帧进行更新的示例都基于已知值。 But here _id I will get as and when the code runs.但是在这里_id我会在代码运行时得到。

The logic I had written was to fetch all doc & store their _id & then for each _id update the age .我编写的逻辑是获取所有文档并存储它们的_id ，然后为每个_id更新年龄。 But its not an efficient way if I want to update all documents in bulk for each day of 6 months.但如果我想在 6 个月的每一天批量更新所有文档，这不是一种有效的方法。

Can anyone give me some ideas for this or point me in the right direction.谁能给我一些想法或指出正确的方向。

Answer 1

As mentioned in the comments, fetching the IDs won't be necessary.如评论中所述，无需获取 ID。 You don't even need to fetch the documents themselves!您甚至不需要自己获取文件！

A single _update_by_query call will be enough.一个_update_by_query调用就足够了。 You can use ChronoUnit to get the difference after you've parsed the dates:解析日期后，您可以使用ChronoUnit来获得差异：

POST your-index-name/_update_by_query
{
  "query": {
    "match_all": {}
  },
  "script": {
    "source": """
      def created =  LocalDateTime.parse(ctx._source.creationtime, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss"));

      def currentdate = LocalDateTime.parse(ctx._source.currentdate, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
    
      def months = ChronoUnit.MONTHS.between(created, currentdate);
      ctx._source._age = months + ' month' + (months > 1 ? 's' : '');
    """,
    "lang": "painless"
  }
}

The official python client has this method too . python 官方客户端也有这个方法。 Here's a working example .这是一个工作示例。

Try running this update script on a small subset of your documents before letting in out on your whole index by adding a query other than the match_all I put there.尝试在文档的一小部分上运行此更新脚本，然后通过添加除我放在那里的match_all之外的查询来释放整个索引。

It's worth mentioning that unless you search on this age field, it doesn't need to be stored in your index because it can be calculated at query time.值得一提的是，除非你在这个age字段上进行搜索，否则它不需要存储在你的索引中，因为它可以在查询时计算出来。

You see, if your index mapping's dates are properly defined like so:你看，如果你的索引映射的日期是这样正确定义的：

{
  "mappings": {
    "properties": {
      "creationtime": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ss"
      },
      "currentdate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      },
      ...
    }
  }
}

the age can be calculated as a script field : age可以计算为脚本字段：

POST ttimes/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "age_calculated": {
      "script": {
        "source": """
          def months = ChronoUnit.MONTHS.between(
                          doc['creationtime'].value,
                          doc['currentdate'].value );
          return months + ' month' + (months > 1 ? 's' : '');
        """
      }
    }
  }
}

The only caveat is, the value won't be inside of the _source but rather inside of its own group called fields (which implies that more script fields are possible at once.).唯一需要注意的是，该值不会在_source内，而是在其自己的称为fields的组内（这意味着一次可能有更多脚本字段。）。

"hits" : [
  {
    ...
    "_id" : "FFfPuncBly0XYOUcdIs5",
    "fields" : {
      "age_calculated" : [ "32 months" ]   <--
    }
  },
  ...

使用 Python 批量更新 elasticsearch 文档

问题描述

1 个解决方案

解决方案1
0 2021-02-19 14:57:26

使用 Python 批量更新 elasticsearch 文档

问题描述

1 个解决方案

解决方案1 0 2021-02-19 14:57:26

解决方案1
0 2021-02-19 14:57:26