[英]Json to csv conversion taking very huge time on large files in python
I am trying to convert a very large json file to csv.The code works well with smaller files but taking so much time while running the same code on larger files i tested it first on 91 mb file containing 80,000 entries and it took around 45 minutes but after that for a bigger files containing 300,000 entries it took around 5 hours. 我正在尝试将非常大的json文件转换为csv。该代码适用于较小的文件,但是在较大的文件上运行相同的代码时花费了很多时间,我首先在包含80,000个条目的91 mb文件上对其进行了测试,大约花费了45分钟但是之后,对于包含300,000个条目的更大文件,大约需要5个小时。 is there some way to do it through multi processing?
有什么办法可以通过多重处理做到这一点? i am a beginner python programmer so dont have idea to use multi processing or multi threading in python.
我是Python初学者,所以不知道在python中使用多处理或多线程。 here is my code
这是我的代码
import json
import time
import pandas as pd
csv_project=pd.DataFrame([],columns=['abstract','authors','n_citation',"references","title","venue","year",'id'])
with open('test.json','r') as f:
data = f.readlines()
j=0
for k,i in enumerate(data):
if '{' in i and '}' in i:
j+=1
dictionary=json.loads(i)
csv_project=csv_project.append(dictionary,ignore_index=True)
else:
pass
if j == 10000:
print(str(k)+'number of entries done')
csv_project.to_csv('data.csv')
j=0
csv_project.to_csv('data.csv')
Any useful help will be appreciated. 任何有用的帮助将不胜感激。 edit here is the sample json format .
在这里编辑的是示例json格式。
{"abstract": "AdaBoost algorithm based on Haar-like features can achieves high accuracy (above 95%) in object detection.",
"authors": ["Zheng Xu", "Runbin Shi", "Zhihao Sun", "Yaqi Li", "Yuanjia Zhao", "Chenjian Wu"],
"n_citation": 0,
"references": ["0a11984c-ab6e-4b75-9291-e1b700c98d52", "1f4152a3-481f-4adf-a29a-2193a3d4303c", "3c2ddf0a-237b-4d17-8083-c90df5f3514b", "522ce553-29ea-4e0b-9ad3-0ed4eb9de065", "579e5f24-5b13-4e92-b255-0c46d066e306", "5d0b987d-eed9-42ce-9bf3-734d98824f1b", "80656b4d-b24c-4d92-8753-bdb965bcd50a", "d6e37fb1-5f7e-448e-847b-7d1f1271c574"],
"title": "A Heterogeneous System for Real-Time Detection with AdaBoost",
"venue": "high performance computing and communications",
"year": 2016,
"id": "001eef4f-1d00-4ae6-8b4f-7e66344bbc6e"}
{"abstract": "In this paper, a kind of novel jigsaw EBG structure is designed and applied into conformal antenna array",
"authors": ["Yufei Liang", "Yan Zhang", "Tao Dong", "Shan-wei Lu"],
"n_citation": 0,
"references": [],
"title": "A novel conformal jigsaw EBG structure design",
"venue": "international conference on conceptual structures",
"year": 2016,
"id": "002e0b7e-d62f-4140-b015-1fe29a9acbaa"}
You keep all your data in memory, once as lines and once as dataframe. 您将所有数据保存在内存中,一次作为行,一次作为数据帧。 This could slow down your processing.
这可能会减慢您的处理速度。
Using the csv
-module would allow you, to process the file in streaming mode: 使用
csv
-module将允许您以流模式处理文件:
import json
import csv
with open('test.json') as lines, open('data.csv', 'w') as output:
output = csv.DictWriter(output, ['abstract','authors','n_citation',"references","title","venue","year",'id'])
output.writeheader()
for line in lines:
line = line.strip()
if line[0] == '{' and line[-1] == '}':
output.writerow(json.loads(line))
It seems you're reading a json lines
file, which might look something like this: 看来您正在读取
json lines
文件,该文件可能看起来像这样:
{key1: value1, key2: [value2, value3, value4], key3: value3}
{key1: value4, key2: [value5, value6], key3: value7}
Notice no commas at the end, and each line itself is a valid json
format. 请注意,结尾没有逗号,并且每一行本身都是有效的
json
格式。
Lucky for you, pandas can read the json lines
file directly like this: 幸运的是,熊猫可以像这样直接读取
json lines
文件:
pd.read_json('test.json', lines=True)
Since your column names are exactly the same as your json
keys, there's no need for you to set up a blank DataFrame
ahead of time. 由于您的列名与
json
键完全相同,因此您DataFrame
提前设置空白的DataFrame
。 The read_json
will do all the parsing for you. read_json
将为您完成所有解析。 Example: 例:
df = pd.read_json('test.json', lines=True)
print(df)
abstract ... year
0 AdaBoost algorithm based on Haar-like features... ... 2016
1 In this paper, a kind of novel jigsaw EBG stru... ... 2016
[2 rows x 8 columns]
Even luckier, if you are limited by size, there is a chunksize
argument you can use which turns the .read_json
method into a generator: 更幸运的是,如果您受到大小的限制,可以使用一个
chunksize
参数将.read_json
方法转换为生成器:
json_reader = pd.read_json('test.json', lines=True, chunksize=10000)
Now when you iterate through json_reader
, each time it will output a DataFrame
of the next 10,000 rows from the json
file. 现在,当您遍历
json_reader
,每次它将输出json
文件中接下来的10,000行的DataFrame
。 Example: 例:
for j in json_reader:
print(j)
abstract ... year
0 AdaBoost algorithm based on Haar-like features... ... 2016
1 In this paper, a kind of novel jigsaw EBG stru... ... 2016
[2 rows x 8 columns]
abstract ... year
2 AdaBoost algorithm based on Haar-like features... ... 2016
3 In this paper, a kind of novel jigsaw EBG stru... ... 2016
[2 rows x 8 columns]
abstract ... year
4 AdaBoost algorithm based on Haar-like features... ... 2016
5 In this paper, a kind of novel jigsaw EBG stru... ... 2016
[2 rows x 8 columns]
Combining all these newfound knowledge, you can use the chunksize=10000
and output the chunked DataFrame
as a separate csv
like so: 结合所有这些新发现的知识,可以使用
chunksize=10000
并将成块的DataFrame
作为单独的csv
输出,如下所示:
for i, df in enumerate(json_reader):
df.to_csv('my_csv_file_{}'.format(i))
Here you notice I combined enumerate()
function so we can get an auto-incremented index number, and str.format()
function to append the index number to the generated csv
file. 在这里,您注意到我结合了
enumerate()
函数,这样我们可以获得一个自动递增的索引号,而str.format()
函数将索引号附加到生成的csv
文件中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.