[英]How convert a very large JSON response of a web service into CSV using Python?
I call a web service that returns a very large JSON response.我调用了一个返回非常大的 JSON 响应的 Web 服务。 I want to parse it and convert it into a CSV format using Python.
我想解析它并使用 Python 将其转换为 CSV 格式。 I have written a code to load json and convert it to CSV.
我编写了一个代码来加载 json 并将其转换为 CSV。 However, for a large response it raises MemoryError.
但是,对于大型响应,它会引发 MemoryError。 How can I load and convert response data using streaming?
如何使用流加载和转换响应数据?
Here is my code这是我的代码
import json
from pandas import json_normalize
re = requests.get(url)
data = json.loads(re.text)
df = json_normalize(data)
df.to_csv(fileName, index=False, encoding='utf-8')
Here is a sample of my JSON response:这是我的 JSON 响应示例:
[{"F1":"V1_1","F2":false,,"F3":120,"F4":"URL1","F5":{"F5_1":4,"F5_2":"A"}},
{"F1":"V2_1","F2":true,,"F3":450,"F4":"URL2","F5":{"F5_1":13,"F5_2":"B"}},
{"F1":"V3_1","F2":false,,"F3":312,"F4":"URL3","F5":{"F5_1":6,"F5_2":"C"}},
...
]
The MemoryError occurs in the json.loads() function. MemoryError 发生在 json.loads() 函数中。 I also test following python code:
我还测试了以下 python 代码:
import pandas as pd
response = requests.get(url)
data = response.json()
df = pd.json_normalize(data)
df.to_csv("filename.csv", index=False, encoding="utf-8")
But still there is a MemmoryError on response.json() function.但是 response.json() 函数仍然存在 MemmoryError 。 Is there any idea how I can load and parse and convert such a big JSON response to a CSV file?
有什么想法可以加载和解析如此大的 JSON 响应并将其转换为 CSV 文件吗?
There is no well-known or "the best" way to handle very large JSON files.没有众所周知的或“最佳”的方式来处理非常大的 JSON 文件。
However requests library provide a way to stream results line by line and with modifications of lines it might be possible to achieve your task.然而,请求库提供了一种逐行流式传输结果的方法,并且可以通过修改行来实现您的任务。
The algorithm is simple:算法很简单:
[
, replace separator },\\n{
by }\\n{
, delete closing ]
.[
,将分隔符},\\n{
替换为}\\n{
,删除结尾]
。 For your example you'll need to replace double coma ,,
with single one ,
as well,,
替换为 single one ,
以及At the end you should receive code resembling the following:最后,您应该收到类似于以下内容的代码:
import requests
import json, pandas as pd
url = '...'
filename = '...'
def decode_record(r):
pl: str = r.decode('utf-8')
pl = pl.replace('[{', '{').replace('}},', '}}').replace(',,', ',').replace('}}]', '}}')
# The rest of cleanup goes here
return json.loads(pl)
def run():
r = requests.get(url, stream=True)
res = []
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
jso = decode_record(line)
# You might also want to stream lines directly to CSV file here,
# just not to allocate the DataFrame
res.append(jso)
df = pd.DataFrame(res)
# Parsingof F5 field may be better performed with Pandas functions
# because it's still a complex object
print(df.info())
df.to_csv(filename, index=False, encoding='utf-8')
The variation without a dataframe:没有数据框的变化:
import requests
import json
url = '...'
filename = '...'
def decode_record(r):
pl: str = r.decode('utf-8')
pl = pl.replace('[{', '{').replace('}},', '}}').replace(',,', ',').replace('}}]', '}}')
# The rest of cleanup goes here
return json.loads(pl)
def encode_csv_record(jso):
res = []
for k,v in jso.items():
res.append(str(v))
return ','.join(res)
def run():
r = requests.get(url, stream=True)
res = []
with open(filename, 'w') as csvout:
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
jso = decode_record(line)
csv_line = encode_csv_record(jso)
csvout.writelines(csv_line)
Of cause this answer has some leaks but it should present the idea.因为这个答案有一些漏洞,但它应该提出这个想法。
This may also get a memory error, but is a simplified approach.这也可能会导致内存错误,但这是一种简化的方法。
# Get JSON Data
re = requests.get(url)
# Write to .CSV
f = open(fileName, "w")
with f:
f.write(re.text)
f.close()
Using the yield command instead of return statement would allow you to return units of work. 使用yield命令而不是return语句将允许您返回工作单元。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.