Json到CSV的转换在python中的大文件上花费了很多时间

Question

I am trying to convert a very large json file to csv.The code works well with smaller files but taking so much time while running the same code on larger files i tested it first on 91 mb file containing 80,000 entries and it took around 45 minutes but after that for a bigger files containing 300,000 entries it took around 5 hours. 我正在尝试将非常大的json文件转换为csv。该代码适用于较小的文件，但是在较大的文件上运行相同的代码时花费了很多时间，我首先在包含80,000个条目的91 mb文件上对其进行了测试，大约花费了45分钟但是之后，对于包含300,000个条目的更大文件，大约需要5个小时。 is there some way to do it through multi processing? 有什么办法可以通过多重处理做到这一点？ i am a beginner python programmer so dont have idea to use multi processing or multi threading in python. 我是Python初学者，所以不知道在python中使用多处理或多线程。 here is my code 这是我的代码

import json
import time
import pandas as pd

csv_project=pd.DataFrame([],columns=['abstract','authors','n_citation',"references","title","venue","year",'id'])


with open('test.json','r') as f:
    data = f.readlines()
j=0
for k,i in enumerate(data):

    if '{' in i and '}' in i:

        j+=1
        dictionary=json.loads(i)
        csv_project=csv_project.append(dictionary,ignore_index=True)
    else:
        pass 
    if j == 10000:
        print(str(k)+'number of entries done')
        csv_project.to_csv('data.csv')
        j=0
csv_project.to_csv('data.csv')

Any useful help will be appreciated. 任何有用的帮助将不胜感激。 edit here is the sample json format . 在这里编辑的是示例json格式。

    {"abstract": "AdaBoost algorithm based on Haar-like features can achieves high accuracy (above 95%) in object detection.", 
"authors": ["Zheng Xu", "Runbin Shi", "Zhihao Sun", "Yaqi Li", "Yuanjia Zhao", "Chenjian Wu"], 
"n_citation": 0,
 "references": ["0a11984c-ab6e-4b75-9291-e1b700c98d52", "1f4152a3-481f-4adf-a29a-2193a3d4303c", "3c2ddf0a-237b-4d17-8083-c90df5f3514b", "522ce553-29ea-4e0b-9ad3-0ed4eb9de065", "579e5f24-5b13-4e92-b255-0c46d066e306", "5d0b987d-eed9-42ce-9bf3-734d98824f1b", "80656b4d-b24c-4d92-8753-bdb965bcd50a", "d6e37fb1-5f7e-448e-847b-7d1f1271c574"],
 "title": "A Heterogeneous System for Real-Time Detection with AdaBoost",
 "venue": "high performance computing and communications",
 "year": 2016,
 "id": "001eef4f-1d00-4ae6-8b4f-7e66344bbc6e"}


{"abstract": "In this paper, a kind of novel jigsaw EBG structure is designed and applied into conformal antenna array",
 "authors": ["Yufei Liang", "Yan Zhang", "Tao Dong", "Shan-wei Lu"], 
"n_citation": 0, 
"references": [], 
"title": "A novel conformal jigsaw EBG structure design", 
"venue": "international conference on conceptual structures", 
"year": 2016, 
"id": "002e0b7e-d62f-4140-b015-1fe29a9acbaa"}

Answer 1

You keep all your data in memory, once as lines and once as dataframe. 您将所有数据保存在内存中，一次作为行，一次作为数据帧。 This could slow down your processing. 这可能会减慢您的处理速度。

Using the csv -module would allow you, to process the file in streaming mode: 使用csv -module将允许您以流模式处理文件：

import json
import csv

with open('test.json') as lines, open('data.csv', 'w') as output:
    output = csv.DictWriter(output, ['abstract','authors','n_citation',"references","title","venue","year",'id'])
    output.writeheader()
    for line in lines:
        line = line.strip()
        if line[0] == '{' and line[-1] == '}':
            output.writerow(json.loads(line))

Answer 2

It seems you're reading a json lines file, which might look something like this: 看来您正在读取json lines文件，该文件可能看起来像这样：

{key1: value1, key2: [value2, value3, value4], key3: value3}
{key1: value4, key2: [value5, value6], key3: value7}

Notice no commas at the end, and each line itself is a valid json format. 请注意，结尾没有逗号，并且每一行本身都是有效的json格式。

Lucky for you, pandas can read the json lines file directly like this: 幸运的是，熊猫可以像这样直接读取json lines文件：

pd.read_json('test.json', lines=True)

Since your column names are exactly the same as your json keys, there's no need for you to set up a blank DataFrame ahead of time. 由于您的列名与json键完全相同，因此您DataFrame提前设置空白的DataFrame 。 The read_json will do all the parsing for you. read_json将为您完成所有解析。 Example: 例：

df = pd.read_json('test.json', lines=True)
print(df)

                                            abstract  ...   year
0  AdaBoost algorithm based on Haar-like features...  ...   2016
1  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]

Even luckier, if you are limited by size, there is a chunksize argument you can use which turns the .read_json method into a generator: 更幸运的是，如果您受到大小的限制，可以使用一个chunksize参数将.read_json方法转换为生成器：

json_reader = pd.read_json('test.json', lines=True, chunksize=10000)

Now when you iterate through json_reader , each time it will output a DataFrame of the next 10,000 rows from the json file. 现在，当您遍历json_reader ，每次它将输出json文件中接下来的10,000行的DataFrame 。 Example: 例：

for j in json_reader:
  print(j)

                                            abstract  ...   year
0  AdaBoost algorithm based on Haar-like features...  ...   2016
1  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]
                                            abstract  ...   year
2  AdaBoost algorithm based on Haar-like features...  ...   2016
3  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]
                                            abstract  ...   year
4  AdaBoost algorithm based on Haar-like features...  ...   2016
5  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]

Combining all these newfound knowledge, you can use the chunksize=10000 and output the chunked DataFrame as a separate csv like so: 结合所有这些新发现的知识，可以使用chunksize=10000并将成块的DataFrame作为单独的csv输出，如下所示：

for i, df in enumerate(json_reader):
  df.to_csv('my_csv_file_{}'.format(i))

Here you notice I combined enumerate() function so we can get an auto-incremented index number, and str.format() function to append the index number to the generated csv file. 在这里，您注意到我结合了enumerate()函数，这样我们可以获得一个自动递增的索引号，而str.format()函数将索引号附加到生成的csv文件中。

You can see an example here on Repl.it. 您可以在Repl.it上看到一个示例。

Json到CSV的转换在python中的大文件上花费了很多时间

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-11-26 19:03:43

解决方案2
0 2018-11-26 19:44:02

Json到CSV的转换在python中的大文件上花费了很多时间

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-11-26 19:03:43

解决方案2 0 2018-11-26 19:44:02

解决方案1
1 已采纳 2018-11-26 19:03:43

解决方案2
0 2018-11-26 19:44:02