如何使用 Python 将 Web 服务的非常大的 JSON 响应转换为 CSV？

Question

我调用了一个返回非常大的 JSON 响应的 Web 服务。 我想解析它并使用 Python 将其转换为 CSV 格式。 我编写了一个代码来加载 json 并将其转换为 CSV。 但是，对于大型响应，它会引发 MemoryError。 如何使用流加载和转换响应数据？

这是我的代码

import json
from pandas import json_normalize
re = requests.get(url)
data = json.loads(re.text)
df = json_normalize(data)
df.to_csv(fileName, index=False, encoding='utf-8')

这是我的 JSON 响应示例：

[{"F1":"V1_1","F2":false,,"F3":120,"F4":"URL1","F5":{"F5_1":4,"F5_2":"A"}},
{"F1":"V2_1","F2":true,,"F3":450,"F4":"URL2","F5":{"F5_1":13,"F5_2":"B"}},
{"F1":"V3_1","F2":false,,"F3":312,"F4":"URL3","F5":{"F5_1":6,"F5_2":"C"}},
...
]

MemoryError 发生在 json.loads() 函数中。 我还测试了以下 python 代码：

import pandas as pd
response = requests.get(url)
data = response.json()
df = pd.json_normalize(data)
df.to_csv("filename.csv", index=False, encoding="utf-8")

但是 response.json() 函数仍然存在 MemmoryError 。 有什么想法可以加载和解析如此大的 JSON 响应并将其转换为 CSV 文件吗？

Answer 1

没有众所周知的或“最佳”的方式来处理非常大的 JSON 文件。

然而，请求库提供了一种逐行流式传输结果的方法，并且可以通过修改行来实现您的任务。

算法很简单：

迭代这些行，将每一行解析为 JSON，如请求库中的流结果示例
替换数据流中的“特殊”标记，以便 JSON 解析器可以毫无问题地解析每条记录。 标记：删除开头[ ，将分隔符},\\n{替换为}\\n{ ，删除结尾] 。 对于您的示例，您还需要将 double coma ,,替换为 single one ,以及

最后，您应该收到类似于以下内容的代码：

import requests

import json, pandas as pd

url = '...'
filename = '...'

def decode_record(r):
    pl: str = r.decode('utf-8')
    pl = pl.replace('[{', '{').replace('}},', '}}').replace(',,', ',').replace('}}]', '}}')
    # The rest of cleanup goes here
    return json.loads(pl)


def run():
    r = requests.get(url, stream=True)
    res = []
    for line in r.iter_lines():

        # filter out keep-alive new lines
        if line:
            jso = decode_record(line)
            # You might also want to stream lines directly to CSV file here,
            # just not to allocate the DataFrame
            res.append(jso)
    df = pd.DataFrame(res)
    # Parsingof F5 field may be better performed with Pandas functions
    # because it's still a complex object
    print(df.info())
    df.to_csv(filename, index=False, encoding='utf-8')

没有数据框的变化：

import requests
import json

url = '...'
filename = '...'

def decode_record(r):
    pl: str = r.decode('utf-8')
    pl = pl.replace('[{', '{').replace('}},', '}}').replace(',,', ',').replace('}}]', '}}')
    # The rest of cleanup goes here
    return json.loads(pl)


def encode_csv_record(jso):
    res = []
    for k,v in jso.items():
        res.append(str(v))
    return ','.join(res)


def run():
    r = requests.get(url, stream=True)
    res = []
    with open(filename, 'w') as csvout:
        for line in r.iter_lines():
            # filter out keep-alive new lines
            if line:
                jso = decode_record(line)
                csv_line = encode_csv_record(jso)
                csvout.writelines(csv_line)

因为这个答案有一些漏洞，但它应该提出这个想法。

Answer 2

这也可能会导致内存错误，但这是一种简化的方法。

# Get JSON Data
re = requests.get(url)

# Write to .CSV
f = open(fileName, "w")
with f:
    f.write(re.text)
    f.close()

使用yield命令而不是return语句将允许您返回工作单元。

如何使用 Python 将 Web 服务的非常大的 JSON 响应转换为 CSV？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-02-17 08:51:24

解决方案2
0 2020-10-07 04:39:32

如何使用 Python 将 Web 服务的非常大的 JSON 响应转换为 CSV？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-02-17 08:51:24

解决方案2 0 2020-10-07 04:39:32

解决方案1
1 已采纳 2020-02-17 08:51:24

解决方案2
0 2020-10-07 04:39:32