如何使用 Python 将 Web 服务的非常大的 JSON 响应转换为 CSV？

Question

I call a web service that returns a very large JSON response.我调用了一个返回非常大的 JSON 响应的 Web 服务。 I want to parse it and convert it into a CSV format using Python.我想解析它并使用 Python 将其转换为 CSV 格式。 I have written a code to load json and convert it to CSV.我编写了一个代码来加载 json 并将其转换为 CSV。 However, for a large response it raises MemoryError.但是，对于大型响应，它会引发 MemoryError。 How can I load and convert response data using streaming?如何使用流加载和转换响应数据？

Here is my code这是我的代码

import json
from pandas import json_normalize
re = requests.get(url)
data = json.loads(re.text)
df = json_normalize(data)
df.to_csv(fileName, index=False, encoding='utf-8')

Here is a sample of my JSON response:这是我的 JSON 响应示例：

[{"F1":"V1_1","F2":false,,"F3":120,"F4":"URL1","F5":{"F5_1":4,"F5_2":"A"}},
{"F1":"V2_1","F2":true,,"F3":450,"F4":"URL2","F5":{"F5_1":13,"F5_2":"B"}},
{"F1":"V3_1","F2":false,,"F3":312,"F4":"URL3","F5":{"F5_1":6,"F5_2":"C"}},
...
]

The MemoryError occurs in the json.loads() function. MemoryError 发生在 json.loads() 函数中。 I also test following python code:我还测试了以下 python 代码：

import pandas as pd
response = requests.get(url)
data = response.json()
df = pd.json_normalize(data)
df.to_csv("filename.csv", index=False, encoding="utf-8")

But still there is a MemmoryError on response.json() function.但是 response.json() 函数仍然存在 MemmoryError 。 Is there any idea how I can load and parse and convert such a big JSON response to a CSV file?有什么想法可以加载和解析如此大的 JSON 响应并将其转换为 CSV 文件吗？

Answer 1

There is no well-known or "the best" way to handle very large JSON files.没有众所周知的或“最佳”的方式来处理非常大的 JSON 文件。

However requests library provide a way to stream results line by line and with modifications of lines it might be possible to achieve your task.然而，请求库提供了一种逐行流式传输结果的方法，并且可以通过修改行来实现您的任务。

The algorithm is simple:算法很简单：

Iterate over these lines parsing every single line as JSON as in streaming results example in requests library迭代这些行，将每一行解析为 JSON，如请求库中的流结果示例
replace "special" markers in the data stream so JSON parser could parse each record without problems.替换数据流中的“特殊”标记，以便 JSON 解析器可以毫无问题地解析每条记录。 Markers: delete openning [ , replace separator },\\n{ by }\\n{ , delete closing ] .标记：删除开头[ ，将分隔符},\\n{替换为}\\n{ ，删除结尾] 。 For your example you'll need to replace double coma ,, with single one , as well对于您的示例，您还需要将 double coma ,,替换为 single one ,以及

At the end you should receive code resembling the following:最后，您应该收到类似于以下内容的代码：

import requests

import json, pandas as pd

url = '...'
filename = '...'

def decode_record(r):
    pl: str = r.decode('utf-8')
    pl = pl.replace('[{', '{').replace('}},', '}}').replace(',,', ',').replace('}}]', '}}')
    # The rest of cleanup goes here
    return json.loads(pl)


def run():
    r = requests.get(url, stream=True)
    res = []
    for line in r.iter_lines():

        # filter out keep-alive new lines
        if line:
            jso = decode_record(line)
            # You might also want to stream lines directly to CSV file here,
            # just not to allocate the DataFrame
            res.append(jso)
    df = pd.DataFrame(res)
    # Parsingof F5 field may be better performed with Pandas functions
    # because it's still a complex object
    print(df.info())
    df.to_csv(filename, index=False, encoding='utf-8')

The variation without a dataframe:没有数据框的变化：

import requests
import json

url = '...'
filename = '...'

def decode_record(r):
    pl: str = r.decode('utf-8')
    pl = pl.replace('[{', '{').replace('}},', '}}').replace(',,', ',').replace('}}]', '}}')
    # The rest of cleanup goes here
    return json.loads(pl)


def encode_csv_record(jso):
    res = []
    for k,v in jso.items():
        res.append(str(v))
    return ','.join(res)


def run():
    r = requests.get(url, stream=True)
    res = []
    with open(filename, 'w') as csvout:
        for line in r.iter_lines():
            # filter out keep-alive new lines
            if line:
                jso = decode_record(line)
                csv_line = encode_csv_record(jso)
                csvout.writelines(csv_line)

Of cause this answer has some leaks but it should present the idea.因为这个答案有一些漏洞，但它应该提出这个想法。

Answer 2

This may also get a memory error, but is a simplified approach.这也可能会导致内存错误，但这是一种简化的方法。

# Get JSON Data
re = requests.get(url)

# Write to .CSV
f = open(fileName, "w")
with f:
    f.write(re.text)
    f.close()

Using the yield command instead of return statement would allow you to return units of work. 使用yield命令而不是return语句将允许您返回工作单元。

如何使用 Python 将 Web 服务的非常大的 JSON 响应转换为 CSV？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-02-17 08:51:24

解决方案2
0 2020-10-07 04:39:32

如何使用 Python 将 Web 服务的非常大的 JSON 响应转换为 CSV？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-02-17 08:51:24

解决方案2 0 2020-10-07 04:39:32

解决方案1
1 已采纳 2020-02-17 08:51:24

解决方案2
0 2020-10-07 04:39:32