简体   繁体   English

ElasticSearch批量更新:使用python脚本组织JSON

[英]ElasticSearch bulk update: organizing JSON using python script

So I am trying to index some public Amazon data set (products) into my ElasticSearch. 因此,我尝试将一些公共Amazon数据集(产品)索引到我的ElasticSearch中。

I have a very large JSON file for data (9.9 Gigabytes). 我有一个非常大的JSON文件用于数据存储(9.9 GB)。 I have splitted the file into various smaller files (for memory's sake), and now each file has the following structure: 我已将文件拆分为多个较小的文件(出于内存考虑),现在每个文件都具有以下结构:

{"asin": "0001048791", "salesRank": {"Books": 6334800}, "imUrl": "http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg", "categories": [["Books"]], "title": "The Crucible: Performed by Stuart Pankin, Jerome Dempsey & Cast"}
{"asin": "0000143561", "categories": [["Movies & TV", "Movies"]], "description": "3Pack DVD set - Italian Classics, Parties and Holidays.", "title": "Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays", "price": 12.99, "salesRank": {"Movies & TV": 376041}, "imUrl": "http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif", "related": {"also_viewed": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC", "B002I5GNVU", "B000RBU4BM"], "buy_after_viewing": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC"]}}
{"asin": "0000037214", "related": {"also_viewed": ["B00JO8II76", "B00DGN4R1Q", "B00E1YRI4C"]}, "title": "Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory", "price": 6.99, "salesRank": {"Clothing": 1233557}, "imUrl": "http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg", "brand": "Big Dreams", "categories": [["Clothing, Shoes & Jewelry", "Girls"], ["Clothing, Shoes & Jewelry", "Novelty, Costumes & More", "Costumes & Accessories", "More Accessories", "Kids & Baby"]]}

Products are JSON objects, arranged one-in-a-line. 产品是JSON对象,一行一行地排列。

Now I want to use ElasticSearch _bulk update to index all this data. 现在,我想使用ElasticSearch _bulk更新来索引所有这些数据。

Now since each document in ElasticSearch requires a header (Correct?), I have written a python script to create a new file with appropriate format. 现在,由于ElasticSearch中的每个文档都需要一个标头(正确吗?),因此我编写了一个python脚本来创建具有适当格式的新文件。

The shell script looks like this: Shell脚本如下所示:

#!/bin/sh

# 0. Some constants to re-define to match your environment
ES_HOST=localhost:9200
JSON_FILE_IN=/home/aksarora/amazon-sample/parts/newaa
JSON_FILE_OUT=/home/aksarora/amazon-sample/parts_parsed/newaa.json

# 1. Python code to transform your JSON file
PYTHON="import json,sys;
out = open('$JSON_FILE_OUT', 'w');
with open('$JSON_FILE_IN', 'r') as json_in:
    docs = [json.loads(line) for line in json_in]
    for doc in docs:
        out.write('%s\n' % json.dumps({\"index\": {}}));
        out.write('%s\n' % json.dumps(doc, indent=0).replace('\n', ''));
"

# 2. run the Python script from step 1
python3 -c "$PYTHON"

# 3. use the output file from step 2 in the curl command
curl -s -XPOST $ES_HOST/amazon/products/_bulk --data-binary @$JSON_FILE_OUT 

But when I run this, I get the following error: 但是当我运行它时,出现以下错误:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "<string>", line 4, in <listcomp>
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 42 (char 41)
{"error":{"root_cause":[{"type":"parse_exception","reason":"Failed to derive xcontent"}],"type":"parse_exception","reason":"Failed to derive xcontent"},"status":400}

Any idea what I am doing wrong? 知道我在做什么错吗? Thanks. 谢谢。

I don't get the same results when I run the following which tries to reproduce the problem: 当我运行以下尝试重现该问题的程序时,我没有得到相同的结果:

import json
import sys

json_in =(
    """{"asin": "0001048791", "salesRank": {"Books": 6334800}, "imUrl": "http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg", "categories": [["Books"]], "title": "The Crucible: Performed by Stuart Pankin, Jerome Dempsey &amp; Cast"}""",
    """{"asin": "0000143561", "categories": [["Movies & TV", "Movies"]], "description": "3Pack DVD set - Italian Classics, Parties and Holidays.", "title": "Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays", "price": 12.99, "salesRank": {"Movies & TV": 376041}, "imUrl": "http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif", "related": {"also_viewed": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC", "B002I5GNVU", "B000RBU4BM"], "buy_after_viewing": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC"]}}""",
    """{"asin": "0000037214", "related": {"also_viewed": ["B00JO8II76", "B00DGN4R1Q", "B00E1YRI4C"]}, "title": "Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory", "price": 6.99, "salesRank": {"Clothing": 1233557}, "imUrl": "http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg", "brand": "Big Dreams", "categories": [["Clothing, Shoes & Jewelry", "Girls"], ["Clothing, Shoes & Jewelry", "Novelty, Costumes & More", "Costumes & Accessories", "More Accessories", "Kids & Baby"]]}""",
)

out = sys.stdout  # send output to screen
docs = [json.loads(line) for line in json_in]  # assume one object per line
for doc in docs:
    out.write('%s\n' % json.dumps({"index": {}}))
    out.write('%s\n' % json.dumps(doc))

Output: 输出:

{"index": {}}
{"asin": "0001048791", "salesRank": {"Books": 6334800}, "imUrl": "http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg", "categories": [["Books"]], "title": "The Crucible: Performed by Stuart Pankin, Jerome Dempsey &amp; Cast"}
{"index": {}}
{"asin": "0000143561", "description": "3Pack DVD set - Italian Classics, Parties and Holidays.", "title": "Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays", "price": 12.99, "imUrl": "http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif", "related": {"also_viewed": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC", "B002I5GNVU", "B000RBU4BM"], "buy_after_viewing": ["B0036FO6SI", "B000KL8ODE", "000014357X", "B0037718RC"]}, "salesRank": {"Movies & TV": 376041}, "categories": [["Movies & TV", "Movies"]]}
{"index": {}}
{"asin": "0000037214", "title": "Purple Sequin Tiny Dancer Tutu Ballet Dance Fairy Princess Costume Accessory", "price": 6.99, "imUrl": "http://ecx.images-amazon.com/images/I/31mCncNuAZL.jpg", "related": {"also_viewed": ["B00JO8II76", "B00DGN4R1Q", "B00E1YRI4C"]}, "salesRank": {"Clothing": 1233557}, "brand": "Big Dreams", "categories": [["Clothing, Shoes & Jewelry", "Girls"], ["Clothing, Shoes & Jewelry", "Novelty, Costumes & More", "Costumes & Accessories", "More Accessories", "Kids & Baby"]]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM