如何计算元数据并将其添加到现有 Elasticsearch 索引？

Question

I loaded over 38 million documents (text strings) to an Elasticsearch index on my local machine.我将超过 3800 万个文档（文本字符串）加载到本地机器上的 Elasticsearch 索引中。 I would like to compute the length of each string and add that value as meta data in the index.我想计算每个字符串的长度并将该值作为元数据添加到索引中。

Should I have computed the string lengths as meta data before loading the documents to Elasticsearch?在将文档加载到 Elasticsearch 之前，我是否应该将字符串长度计算为元数据？ Or, can I update the meta data with a computed value after the fact?或者，我可以在事后使用计算值更新元数据吗？

I'm relatively new to Elasticsearch/Kibana and these questions arose because of the following Python experiments:我对 Elasticsearch/Kibana 比较陌生，这些问题是由于以下 Python 实验而产生的：

Data as a list of strings数据作为字符串列表
```
 mylist = ['string_1', 'string_2',..., 'string_N'] L = [len(s) for s in mylist] # this computation takes about 1 minute on my machine
```
The downside of option 1 is that I'm not leveraging Elasticsearch and 'mylist' is occupying a large chunk of memory.选项 1 的缺点是我没有利用 Elasticsearch，并且“mylist”占用了大量内存。
Data as an Elasticsearch index where each string in 'mylist' was loaded into the field 'text'.作为 Elasticsearch 索引的数据，其中“mylist”中的每个字符串都被加载到“text”字段中。
```
 from haystack.document_store.elasticsearch import ElasticsearchDocumentStore document_store = ElasticsearchDocumentStore(host='localhost', username='', password='', index='myindex') docs = document_store.get_all_documents_generator() L = [len(d.text) for d in docs] # this computation takes about 6 minutes on my machine
```
The downside of option 2 is that it took much longer to compute.选项 2 的缺点是计算时间更长。 The upside is the generator() freed up memory.好处是 generator() 释放了内存。 The long computation time is why I thought storing the string length (and other analytics) as meta data in Elasticsearch would be a good solution.较长的计算时间是为什么我认为将字符串长度（和其他分析）作为元数据存储在 Elasticsearch 中的原因是一个很好的解决方案。

Are there other options I should consider?我应该考虑其他选择吗？ What am I missing?我错过了什么？

Answer 1

If you want to store the size of the whole document , I suggest installing the mapper-size plugin , which will store the size of the source document in the _size field.如果要存储整个文档的大小，我建议安装mapper-size插件，它将源文档的大小存储在_size字段中。

If you only want to store the size of a specific field of your source document, then you need to proceed differently.如果您只想存储源文档的特定字段的大小，那么您需要以不同的方式进行。

What I suggest is to create an ingest pipeline that will process each document just before it gets indexed.我的建议是创建一个摄取管道，该管道将在每个文档被索引之前对其进行处理。 That ingest pipeline can then be used either when indexing the documents the first time or after having loaded the documents.然后可以在第一次索引文档时或加载文档后使用该摄取管道。 I'll show you how.我会告诉你怎么做。

First, create the ingest pipeline with a script processor that will store the size of the string in the text field in another field called textLength .首先，使用script处理器创建摄取管道，该处理器将text字段中的字符串大小存储在另一个名为textLength的字段中。

PUT _ingest/pipeline/string-length
{
  "description": "My optional pipeline description",
  "processors": [
    {
      "script": {
        "source": "ctx.textLength = ctx.text.length()"
      }
    }
  ]
}

So, in the case you've already loaded the documents into Elasticsearch and would like to enrich each document with the length of one of its fields, you can do it after the fact by using the Update by Query API , like this:因此，如果您已经将文档加载到 Elasticsearch 并希望使用其中一个字段的长度来丰富每个文档，您可以在事后使用Update by Query API来完成，如下所示：

POST myindex/_update_by_query?pipeline=string-length&wait_for_completion=false

It is also possible to leverage that ingest pipeline at indexing time when the documents get indexed the first time, simply by referencing the pipeline in your index query, like this:当文档第一次被索引时，也可以在索引时利用该摄取管道，只需在索引查询中引用管道，如下所示：

PUT myindex/_doc/123?pipeline=string-length

Both options will work, try it out and pick the one that best suits your needs.这两种选择都行得通，试一试，然后选择最适合您需求的一种。

如何计算元数据并将其添加到现有 Elasticsearch 索引？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-11-12 07:13:55

如何计算元数据并将其添加到现有 Elasticsearch 索引？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-11-12 07:13:55

解决方案1
1 已采纳 2021-11-12 07:13:55