使用 python 将字典或 JSON 文件列表导入到弹性搜索

Question

I have a .json.gz file that I wish to load into elastic search.我有一个.json.gz文件，我希望加载到弹性搜索中。

My first attempt involved using the json module to convert the JSON to a list of dicts.我的第一次尝试涉及使用json模块将 JSON 转换为字典列表。

import gzip
import json
from pprint import pprint
from elasticsearch import Elasticsearch

nodes_f = gzip.open("nodes.json.gz")
nodes = json.load(nodes_f)

Dict example:字典示例：

pprint(nodes[0])

{u'index': 1,
 u'point': [508163.122, 195316.627],
 u'tax': u'fehwj39099'}

Using Elasticsearch:使用弹性搜索：

es = Elasticsearch()

data = es.bulk(index="index",body=nodes)

However, this returns:但是，这将返回：

elasticsearch.exceptions.RequestError: TransportError(400, u'illegal_argument_exception', u'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]')

Beyond this, I wish to be able to find the tax for given point query, in case this has an impact on how I should be indexing the data with elasticsearch.除此之外，我希望能够找到给point查询的tax ，以防这对我应该如何使用 elasticsearch 索引数据产生影响。

Answer 1

Alfe pointed me in the right direction, but I couldn't get his code to work. Alfe 为我指明了正确的方向，但我无法让他的代码工作。

I found two solutions:我找到了两个解决方案：

Line by line with a for loop:逐行使用 for 循环：

es = elasticsearch.Elasticsearch()

for node in nodes:
    _id = node['index']
    es.index(index='nodes',doc_type='external',id=_id,body=node)

In bulk, using helper :批量使用helper ：

actions = [
    {
    "_index" : "nodes_bulk",
    "_type" : "external",
    "_id" : str(node['index']),
    "_source" : node
    }
for node in nodes
]

helpers.bulk(es,actions)

Bulk was around 22 times faster for a list of 343724 dicts.对于343724的列表，Bulk 大约快22倍。

Answer 2

The ES bulk library showed several problems, including performance trouble, not being able to set specific _id s etc. But since the bulk API of ES is not very complicated, we did it ourselves: ES的批量库出现了几个问题，包括性能问题，不能设置特定的_id等。但是由于ES的批量API不是很复杂，我们自己做了：

import requests

headers = { 'Content-type': 'application/json',
            'Accept': 'text/plain'}

jsons = []
for d in docs:
   _id = d.pop('_id')  # take _id out of dict
   jsons.append('{"index":{"_id":"%s"}}\n%s\n' % (_id, json.dumps(d)))
data = ''.join(jsons)
response = requests.post(url, data=data, headers=headers)

We needed to set a specific _id but I guess you can skip this part in case you want a random _id set by ES automatically.我们需要设置一个特定的_id但我想你可以跳过这部分，以防你想要一个由 ES 自动设置的随机_id 。

Hope that helps.希望有帮助。

Answer 3

Here is my working code using bulk api:这是我使用批量 api 的工作代码：

Define a list of dicts:定义一个字典列表：

from elasticsearch import Elasticsearch, helpers
es = Elasticsearch([{'host':'localhost', 'port': 9200}])

doc = [{'_id': 1,'price': 10, 'productID' : 'XHDK-A-1293-#fJ3'},
   {'_id':2, "price" : 20, "productID" : "KDKE-B-9947-#kL5"}, 
   {'_id':3, "price" : 30, "productID" : "JODL-X-1937-#pV7"},
   {'_id':4, "price" : 30, "productID" : "QQPX-R-3956-#aD8"}]

helpers.bulk(es, doc, index='products',doc_type='_doc', request_timeout=200)

使用 python 将字典或 JSON 文件列表导入到弹性搜索

问题描述

3 个解决方案

解决方案1
9 已采纳 2017-03-08 13:10:07

解决方案2
3 2017-03-06 11:05:36

解决方案3
3 2019-12-28 06:42:41

使用 python 将字典或 JSON 文件列表导入到弹性搜索

问题描述

3 个解决方案

解决方案1 9 已采纳 2017-03-08 13:10:07

解决方案2 3 2017-03-06 11:05:36

解决方案3 3 2019-12-28 06:42:41

解决方案1
9 已采纳 2017-03-08 13:10:07

解决方案2
3 2017-03-06 11:05:36

解决方案3
3 2019-12-28 06:42:41