python json 到 csv 逐行读取速度很慢

Question

I have a large JSON file about tens of GB, I adopted reading the file line by line, but it is slow to write about 10M for 1 minute.我有一个很大的JSON文件，大约几十GB，我采用逐行读取文件，但是写1分钟10M左右很慢。 help me to modify the code帮我修改代码

this is my code这是我的代码



import pandas as pd
import json

add_header = True

with open('1.json') as f_json:
    for line in f_json:
        line = line.strip()
        
        df = pd.json_normalize(json.loads(line))
        df.to_csv('1.csv', index=None, mode='a', header=add_header)
        add_header = False

I also tried to use chunked reading, but got an error, the code:我也尝试使用分块阅读，但出现错误，代码：

import pandas as pd
import json


data = pd.read_json('G:\\1.json',
                           encoding='utf8',lines=True,chunksize=100000)
for df in data:
    line = df.strip()
    df = pd.json_normalize(json.loads(line))
    df.to_csv('G:\\1.csv', index=None, mode='a',encoding='utf8')

output output

    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'strip'
Process finished with exit code -1

Here is my JSON file这是我的 JSON 文件

{"_index":"core-bvd-dmc","_type":"_doc","_id":"e22762d5c4b81fbcad62b5c1d77226ec","_score":1,"_source":{"a_id":"P305906272","a_id_type":"Contact ID","a_name":"Mr Chuanzong Chen","a_name_normal":"MR CHUANZONG CHEN","a_job_title":"Executive director and general manager","relationship":"Currently works for (Executive director and general manager)","b_id":"CN9390051924","b_id_type":"BVD ID","b_name":"Yantai haofeng trade co., ltd.","b_name_normal":"YANTAI HAOFENG TRADE CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Bei da jie 53hao 1609shi; Zhi fu qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18354522225","b_email":"18354522225@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"Unified social credit code":["91370602073035263P"],"Trade register number":["370602200112047"],"NOC":["073035263"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}
{"_index":"core-bvd-dmc","_type":"_doc","_id":"97871f8842398794e380a748f5b82ea5","_score":1,"_source":{"a_id":"P305888975","a_id_type":"Contact ID","a_name":"Mr Hengchao Jiang","a_name_normal":"MR HENGCHAO JIANG","a_job_title":"Legal representative","relationship":"Currently works for (Legal representative)","b_id":"CN9390053357","b_id_type":"BVD ID","b_name":"Yantai ji hong educate request information co., ltd.","b_name_normal":"YANTAI JI HONG EDUCATE REQUEST INFORMATION CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Ying chun da jie 131hao nei 1hao; Lai shan qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":"+86 18694982966","b_email":"xyw_747@163.com","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"NOC":["597807789"],"Trade register number":["370613200023836"],"Unified social credit code":["913706135978077898"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}

Answer 1

chunksize is returning the iterator of dataframes, so you cannot do strip or json.loads on that. chunksize正在返回数据帧的迭代器，因此您不能对其进行strip或json.loads 。

You would probably need你可能需要

for subdf in data:
    # subdf is already a dataframe.  
    temp_df = pd.concat([subdf[['_index', '_id', '_score']], pd.json_normalize(subdf._source)], axis=1)
    temp_df.to_csv(filename, index=None, mode='a',encoding='utf8')

You can modify the pd.concat line to flatten/extract the data you want but I hope you get the idea.您可以修改pd.concat行以展平/提取您想要的数据，但我希望您明白这一点。

Another thought I have is that although, csv can hold large data better than JSON but would you consider chunk the output csv into multiple files instead of creating a huge csv? Another thought I have is that although, csv can hold large data better than JSON but would you consider chunk the output csv into multiple files instead of creating a huge csv?

python json 到 csv 逐行读取速度很慢

问题描述

1 个解决方案

解决方案1
0 2022-02-03 15:40:41

python json 到 csv 逐行读取速度很慢

问题描述

1 个解决方案

解决方案1 0 2022-02-03 15:40:41

解决方案1
0 2022-02-03 15:40:41