I have a .json.gz
file that I wish to load into elastic search.
My first attempt involved using the json
module to convert the JSON to a list of dicts.
import gzip
import json
from pprint import pprint
from elasticsearch import Elasticsearch
nodes_f = gzip.open("nodes.json.gz")
nodes = json.load(nodes_f)
Dict example:
pprint(nodes[0])
{u'index': 1,
u'point': [508163.122, 195316.627],
u'tax': u'fehwj39099'}
Using Elasticsearch:
es = Elasticsearch()
data = es.bulk(index="index",body=nodes)
However, this returns:
elasticsearch.exceptions.RequestError: TransportError(400, u'illegal_argument_exception', u'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]')
Beyond this, I wish to be able to find the tax
for given point
query, in case this has an impact on how I should be indexing the data with elasticsearch.
Alfe pointed me in the right direction, but I couldn't get his code to work.
I found two solutions:
Line by line with a for loop:
es = elasticsearch.Elasticsearch()
for node in nodes:
_id = node['index']
es.index(index='nodes',doc_type='external',id=_id,body=node)
In bulk, using helper
:
actions = [
{
"_index" : "nodes_bulk",
"_type" : "external",
"_id" : str(node['index']),
"_source" : node
}
for node in nodes
]
helpers.bulk(es,actions)
Bulk was around 22
times faster for a list of 343724
dicts.
The ES bulk library showed several problems, including performance trouble, not being able to set specific _id
s etc. But since the bulk API of ES is not very complicated, we did it ourselves:
import requests
headers = { 'Content-type': 'application/json',
'Accept': 'text/plain'}
jsons = []
for d in docs:
_id = d.pop('_id') # take _id out of dict
jsons.append('{"index":{"_id":"%s"}}\n%s\n' % (_id, json.dumps(d)))
data = ''.join(jsons)
response = requests.post(url, data=data, headers=headers)
We needed to set a specific _id
but I guess you can skip this part in case you want a random _id
set by ES automatically.
Hope that helps.
Here is my working code using bulk api:
Define a list of dicts:
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch([{'host':'localhost', 'port': 9200}])
doc = [{'_id': 1,'price': 10, 'productID' : 'XHDK-A-1293-#fJ3'},
{'_id':2, "price" : 20, "productID" : "KDKE-B-9947-#kL5"},
{'_id':3, "price" : 30, "productID" : "JODL-X-1937-#pV7"},
{'_id':4, "price" : 30, "productID" : "QQPX-R-3956-#aD8"}]
helpers.bulk(es, doc, index='products',doc_type='_doc', request_timeout=200)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.