简体   繁体   English

使用 Python 的 Elasticsearch JSON 批量索引

[英]Elasticsearch JSON Bulk Indexing using Python

I have a huge amount of data in a single JSON that I want to get it into Elasticsearch to do some visualizations in Kibana.我在单个 JSON 中有大量数据,我想将其放入 Elasticsearch 以在 Kibana 中进行一些可视化。 My JSON currently looks somewhat like this:我的 JSON 目前看起来有点像这样:

[{"field1": "x", "field2": "y"},
{"field1": "w", "field2": "z"}]
...etc

After doing some research, I found that the best way to feed this data to Elasticsearch is using the Bulk API, but first I need to reformat my data to look like this:在做了一些研究之后,我发现将这些数据提供给 Elasticsearch 的最佳方法是使用 Bulk API,但首先我需要重新格式化我的数据,如下所示:

{"index":{"_index": "myindex", "type": "entity_type", "_id": 1}}
{"field1": "x", "field2": "y"}
{"index":{"_index": "myindex", "type": "entity_type", "_id": 2}}
{"field1": "w", "field2": "z"}
...etc

And then I have to post this file using curl.然后我必须使用 curl 发布这个文件。

All of this is part of a bigger Python project so I would like to know the best way to do the reformatting of my data and how to get it into Elasticsearch using Python.所有这些都是一个更大的 Python 项目的一部分,所以我想知道重新格式化我的数据的最佳方法以及如何使用 Python 将其导入 Elasticsearch。 I've thought of using regular expressions for the reformatting (re.sub and replace) and also I've looked at elasticsearch bulk helper to post the data but I couldn't figure out a solution.我想过使用正则表达式进行重新格式化(re.sub 和 replace),并且我还查看了 elasticsearch 批量帮助程序来发布数据,但我找不到解决方案。

Any help is highly appreciated, thanks.非常感谢任何帮助,谢谢。

Hy!嗨!

According to https://elasticsearch-py.readthedocs.io/en/master/helpers.html#example , the python lib has a couple of helpers for bulk operation.根据https://elasticsearch-py.readthedocs.io/en/master/helpers.html#example ,python 库有几个用于bulk操作的助手。

For example for your case, you could use the following code:例如,对于您的情况,您可以使用以下代码:

def gendata():
    docs = [{"field1": "x", "field2": "y"},{"field1": "w", "field2": "z"}]
    for doc in docs:
        yield {
            "_op_type":"index",
            "_index": "docs",
            "_type": "_doc",
            "doc": doc
        }

bulk(es, gendata())

Your current format is fine provided that you can load the list of dict in memory.您当前的格式很好,前提是您可以在内存中加载 dict 列表。

However, if you cannot load the entire file in memory then you may need to transform your file as new line separated JSON但是,如果您无法在内存中加载整个文件,那么您可能需要将文件转换为新行分隔的 JSON

{"field1": "x", "field2": "y"}
{"field1": "w", "field2": "z"}

and then you should read line by line and using the generator as @banuj suggested.然后你应该逐行阅读并使用@banuj 建议的生成器。

Another nice example can be found here: https://github.com/elastic/elasticsearch-py/blob/master/example/load.py#L76-L130另一个很好的例子可以在这里找到: https : //github.com/elastic/elasticsearch-py/blob/master/example/load.py#L76-L130

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM