简体   繁体   English

使用现有的“ id”字段索引Elasticsearch文档

[英]index Elasticsearch document with existing “id” field

I have documents that I want to index into Elasticsearch with an existing unique "id" field. 我有一些要使用现有的唯一“ id”字段索引到Elasticsearch中的文档。 I get an array of documents from a REST api endpoint ( eg.: http://some.url/api/products) in no particular order and if a document with the _id already exists in Elasticsearch it should update and reindex the document. 我从REST api端点( eg.: http://some.url/api/products)获取的文档( eg.: http://some.url/api/products)没有特定顺序,并且如果Elasticsearch中已经存在带有_id的文档,则应该更新该文档并为其重新编制索引。

I want to create a new document if no document with the _id in Elasticsearch exists and then update a document, if it matches with an existing document in Elasticsearch. 如果在Elasticsearch中不存在带有_id的文档,我想创建一个新文档,然后如果与Elasticsearch中的现有文档匹配,则更新一个文档。

This could be done with: 这可以通过以下方式完成:

PUT products/product/un1qu3-1d-b718-105973677e95 { "id": "un1qu3-1d-b718-105973677e95", "state": "packaged" }

The basic idea is to use the provided "id" field to create or update a document. 基本思想是使用提供的“ id”字段来创建或更新文档。 Extraction of _id from document fields seems deprecated ( link ). 从文档字段中提取_id似乎已弃用( link )。 But the indexing/ reindexing of documents with the "id" field can be done manually very easy with the kibana dev tools, with postman or a cURL request. 但是,使用kibana dev工具,邮递员或cURL请求,可以非常轻松地手动完成带有“ id”字段的文档的索引/重新索引。 I want to achieve this (re-)indexing of documents that I receive over this api endpoint programmatically . 我想以编程方式实现通过此api端点收到的文档的(重新)索引。
Is it possible to achieve this with logstash or a simple cronjob? 是否有可能通过logstash或简单的cronjob来实现? Does Elasticsearch provide any functionality for this? Elasticsearch是否为此提供任何功能? Or do I need to write some custom backend to achieve this? 还是我需要编写一些自定义后端来实现这一目标?

I thought of either: 我想到了:

1) index the document into Elasticsearch with the "id" field of my document or 1)使用文档的“ id”字段将文档编入Elasticsearch或

2) find an Elasticsearch query that first searches for the document with the specific "id" field and then updates the document. 2)找到一个Elasticsearch查询,该查询首先使用特定的“ id”字段搜索文档,然后更新文档。

I was unable to find a solution for either way and have no clue how a good approach would look like. 我无法找到这两种方法的解决方案,也不知道好的方法会是什么样子。

Can anyone point me into the right direction on how to achieve this, suggest a better approach or provide a solution? 谁能为我指出实现此目标的正确方向,提出更好的方法或提供解决方案?

Any help much appreciated! 任何帮助,不胜感激!

Update 更新

I solved the problem with the help of the accepted answer. 我借助公认的答案解决了这个问题。 I used Logstash, the Http_poller input plugin, this article: https://www.elastic.co/blog/new-way-to-ingest-part-1 and this elastic.co question: https://discuss.elastic.co/t/upsert-with-logstash/59116 我用Logstash的Http_poller输入插件,这篇文章: https://www.elastic.co/blog/new-way-to-ingest-part-1这elastic.co问题: https://discuss.elastic.co/t/upsert-with-logstash/59116

My output of logstash looks like this at the moment: 目前,我的logstash输出如下所示:

output {
  elasticsearch {
    index => "products"
    document_type => "product"
    pipeline => "rename_id"
    document_id => "%{id}"
    doc_as_upsert => true
    action => "update"
  }

Update 2 更新2

just for the sake of completeness I added the "rename_id" pipeline 为了完整起见,我添加了“ rename_id”管道

{
  "rename_id": {
    "description": "_description",
    "processors": [
      {
        "set": {
          "field": "_id",
          "value": "{{id}}"
        }
      }
    ]
  }
}

It works this way! 它是这样工作的! Thanks alot! 非常感谢!

Peter, 彼得,

If I understand correctly, you want to ingest your documents into elastic search and will have some updates in future for these documents ? 如果我理解正确,您想将文档提取到弹性搜索中,将来这些文档会进行一些更新吗?

If that's the case, - Use your documents primary key as id for elastic documents. 如果是这样,-将您的文档主键用作弹性文档的ID。 - You can ingest entire document with updated values, elastic will replace the previous document with new one. -您可以使用更新后的值提取整个文档,elastic将用新文档替换以前的文档。 given the primary key is same. 给定的主键是相同的。 Old document with same id will be deleted. 具有相同ID的旧文档将被删除。

We use this approach for our search data. 我们将这种方法用于搜索数据。

you can use ingest pipelines to extract the id from the body and the _create endpoint to only create a document if it does not exist. 您可以使用摄取管道从正文中提取ID,并使用_create端点仅在不存在文档的情况下创建文档。 Minor note: If you could specify the id on the client side indexing would be faster, as adding a pipeline adds a certain overhead. 小注释:如果可以在客户端上指定id,则索引会更快,因为添加管道会增加一定的开销。

PUT _ingest/pipeline/my_pipeline
{
  "description": "_description",
  "processors": [
    {
      "set": {
        "field": "_id",
        "value": "{{id}}"
      }
    }
  ]
}

PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
    "foo" : "bar",
    "id" : "123"
}

GET twitter/tweet/123

# this call will fail
PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
    "foo" : "bar",
    "id" : "123"
}

You can use script to UPSERT (update or insert) your document 您可以使用脚本来UPSERT(更新或插入)您的文档

PUT /products/product/un1qu3-1d-b718-105973677e95/_update
{
   "script": {
      "inline": "ctx._source.state = \"packaged\"",
      "lang": "painless"
   },
   "upsert": {
      "id": "un1qu3-1d-b718-105973677e95",
      "state": "packaged"
   }
}

Above query find the document with _id = "un1qu3-1d-b718-105973677e95" if it is able to find any document then it will update state to "packaged" otherwise create a new document with field "id" and "state" (you can insert as many fields as you want). 在上面的查询中,找到_id =“ un1qu3-1d-b718-105973677e95”的文档(如果能够找到任何文档,则它将状态更新为“已打包”),否则创建带有“ id”和“ state”字段的新文档可以插入任意多个字段)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM