[英]python json to csv Reading line by line is slow
I have a large JSON file about tens of GB, I adopted reading the file line by line, but it is slow to write about 10M for 1 minute.我有一个很大的JSON文件,大约几十GB,我采用逐行读取文件,但是写1分钟10M左右很慢。 help me to modify the code帮我修改代码
this is my code这是我的代码
import pandas as pd
import json
add_header = True
with open('1.json') as f_json:
for line in f_json:
line = line.strip()
df = pd.json_normalize(json.loads(line))
df.to_csv('1.csv', index=None, mode='a', header=add_header)
add_header = False
I also tried to use chunked reading, but got an error, the code:我也尝试使用分块阅读,但出现错误,代码:
import pandas as pd
import json
data = pd.read_json('G:\\1.json',
encoding='utf8',lines=True,chunksize=100000)
for df in data:
line = df.strip()
df = pd.json_normalize(json.loads(line))
df.to_csv('G:\\1.csv', index=None, mode='a',encoding='utf8')
output output
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'strip'
Process finished with exit code -1
Here is my JSON file这是我的 JSON 文件
{"_index":"core-bvd-dmc","_type":"_doc","_id":"e22762d5c4b81fbcad62b5c1d77226ec","_score":1,"_source":{"a_id":"P305906272","a_id_type":"Contact ID","a_name":"Mr Chuanzong Chen","a_name_normal":"MR CHUANZONG CHEN","a_job_title":"Executive director and general manager","relationship":"Currently works for (Executive director and general manager)","b_id":"CN9390051924","b_id_type":"BVD ID","b_name":"Yantai haofeng trade co., ltd.","b_name_normal":"YANTAI HAOFENG TRADE CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Bei da jie 53hao 1609shi; Zhi fu qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18354522225","b_email":"18354522225@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"Unified social credit code":["91370602073035263P"],"Trade register number":["370602200112047"],"NOC":["073035263"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}
{"_index":"core-bvd-dmc","_type":"_doc","_id":"97871f8842398794e380a748f5b82ea5","_score":1,"_source":{"a_id":"P305888975","a_id_type":"Contact ID","a_name":"Mr Hengchao Jiang","a_name_normal":"MR HENGCHAO JIANG","a_job_title":"Legal representative","relationship":"Currently works for (Legal representative)","b_id":"CN9390053357","b_id_type":"BVD ID","b_name":"Yantai ji hong educate request information co., ltd.","b_name_normal":"YANTAI JI HONG EDUCATE REQUEST INFORMATION CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Ying chun da jie 131hao nei 1hao; Lai shan qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18694982966","b_email":"xyw_747@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"NOC":["597807789"],"Trade register number":["370613200023836"],"Unified social credit code":["913706135978077898"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}
chunksize
is returning the iterator of dataframes, so you cannot do strip
or json.loads
on that. chunksize
正在返回数据帧的迭代器,因此您不能对其进行strip
或json.loads
。
You would probably need你可能需要
for subdf in data:
# subdf is already a dataframe.
temp_df = pd.concat([subdf[['_index', '_id', '_score']], pd.json_normalize(subdf._source)], axis=1)
temp_df.to_csv(filename, index=None, mode='a',encoding='utf8')
You can modify the pd.concat
line to flatten/extract the data you want but I hope you get the idea.您可以修改pd.concat
行以展平/提取您想要的数据,但我希望您明白这一点。
Another thought I have is that although, csv can hold large data better than JSON but would you consider chunk the output csv into multiple files instead of creating a huge csv? Another thought I have is that although, csv can hold large data better than JSON but would you consider chunk the output csv into multiple files instead of creating a huge csv?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.