简体   繁体   English

如何使用 Python 将 Web 服务的非常大的 JSON 响应转换为 CSV?

[英]How convert a very large JSON response of a web service into CSV using Python?

I call a web service that returns a very large JSON response.我调用了一个返回非常大的 JSON 响应的 Web 服务。 I want to parse it and convert it into a CSV format using Python.我想解析它并使用 Python 将其转换为 CSV 格式。 I have written a code to load json and convert it to CSV.我编写了一个代码来加载 json 并将其转换为 CSV。 However, for a large response it raises MemoryError.但是,对于大型响应,它会引发 MemoryError。 How can I load and convert response data using streaming?如何使用流加载和转换响应数据?

Here is my code这是我的代码

import json
from pandas import json_normalize
re = requests.get(url)
data = json.loads(re.text)
df = json_normalize(data)
df.to_csv(fileName, index=False, encoding='utf-8')

Here is a sample of my JSON response:这是我的 JSON 响应示例:

[{"F1":"V1_1","F2":false,,"F3":120,"F4":"URL1","F5":{"F5_1":4,"F5_2":"A"}},
{"F1":"V2_1","F2":true,,"F3":450,"F4":"URL2","F5":{"F5_1":13,"F5_2":"B"}},
{"F1":"V3_1","F2":false,,"F3":312,"F4":"URL3","F5":{"F5_1":6,"F5_2":"C"}},
...
]

The MemoryError occurs in the json.loads() function. MemoryError 发生在 json.loads() 函数中。 I also test following python code:我还测试了以下 python 代码:

import pandas as pd
response = requests.get(url)
data = response.json()
df = pd.json_normalize(data)
df.to_csv("filename.csv", index=False, encoding="utf-8")

But still there is a MemmoryError on response.json() function.但是 response.json() 函数仍然存在 MemmoryError 。 Is there any idea how I can load and parse and convert such a big JSON response to a CSV file?有什么想法可以加载和解析如此大的 JSON 响应并将其转换为 CSV 文件吗?

There is no well-known or "the best" way to handle very large JSON files.没有众所周知的或“最佳”的方式来处理非常大的 JSON 文件。

However requests library provide a way to stream results line by line and with modifications of lines it might be possible to achieve your task.然而,请求库提供了一种逐行流式传输结果的方法,并且可以通过修改行来实现您的任务。

The algorithm is simple:算法很简单:

  • Iterate over these lines parsing every single line as JSON as in streaming results example in requests library迭代这些行,将每一行解析为 JSON,如请求库中的流结果示例
  • replace "special" markers in the data stream so JSON parser could parse each record without problems.替换数据流中的“特殊”标记,以便 JSON 解析器可以毫无问题地解析每条记录。 Markers: delete openning [ , replace separator },\\n{ by }\\n{ , delete closing ] .标记:删除开头[ ,将分隔符},\\n{替换为}\\n{ ,删除结尾] For your example you'll need to replace double coma ,, with single one , as well对于您的示例,您还需要将 double coma ,,替换为 single one ,以及

At the end you should receive code resembling the following:最后,您应该收到类似于以下内容的代码:

import requests

import json, pandas as pd

url = '...'
filename = '...'

def decode_record(r):
    pl: str = r.decode('utf-8')
    pl = pl.replace('[{', '{').replace('}},', '}}').replace(',,', ',').replace('}}]', '}}')
    # The rest of cleanup goes here
    return json.loads(pl)


def run():
    r = requests.get(url, stream=True)
    res = []
    for line in r.iter_lines():

        # filter out keep-alive new lines
        if line:
            jso = decode_record(line)
            # You might also want to stream lines directly to CSV file here,
            # just not to allocate the DataFrame
            res.append(jso)
    df = pd.DataFrame(res)
    # Parsingof F5 field may be better performed with Pandas functions
    # because it's still a complex object
    print(df.info())
    df.to_csv(filename, index=False, encoding='utf-8')

The variation without a dataframe:没有数据框的变化:

import requests
import json

url = '...'
filename = '...'

def decode_record(r):
    pl: str = r.decode('utf-8')
    pl = pl.replace('[{', '{').replace('}},', '}}').replace(',,', ',').replace('}}]', '}}')
    # The rest of cleanup goes here
    return json.loads(pl)


def encode_csv_record(jso):
    res = []
    for k,v in jso.items():
        res.append(str(v))
    return ','.join(res)


def run():
    r = requests.get(url, stream=True)
    res = []
    with open(filename, 'w') as csvout:
        for line in r.iter_lines():
            # filter out keep-alive new lines
            if line:
                jso = decode_record(line)
                csv_line = encode_csv_record(jso)
                csvout.writelines(csv_line)

Of cause this answer has some leaks but it should present the idea.因为这个答案有一些漏洞,但它应该提出这个想法。

This may also get a memory error, but is a simplified approach.这也可能会导致内存错误,但这是一种简化的方法。

# Get JSON Data
re = requests.get(url)

# Write to .CSV
f = open(fileName, "w")
with f:
    f.write(re.text)
    f.close()

Using the yield command instead of return statement would allow you to return units of work. 使用yield命令而不是return语句将允许您返回工作单元。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM