用 Python 打開一個大的 JSON 文件

Question

當我嘗試使用 json.load() 打開時，我有一個 1.7 GB 的 JSON 文件，然后出現內存錯誤，如何在 python 中讀取 json 文件？

我的 JSON 文件是一個包含特定鍵的大型對象數組。

編輯：好吧，如果它只是一大堆對象並且事先知道對象的結構，那么就不需要使用我們可以逐行讀取它的工具。 一行將只包含數組的一個元素。 我注意到這是 json 文件的存儲方式，對我來說它就像

>>>for line in open('file.json','r').readline():
...    do something with(line)

Answer 1

您需要一個增量 json 解析器，如yajl及其 Python 綁定之一。 增量解析器從輸入中讀取盡可能少的內容，並在解碼有意義的內容時調用回調。 例如，只從一個大的 json 文件中提取數字：

class ContentHandler(YajlContentHandler):
    def yajl_number(self, ctx, val):
         list_of_numbers.append(float(val))

parser = YajlParser(ContentHandler())
parser.parse(some_file)

有關更多信息，請參閱http://pykler.github.com/yajl-py/ 。

Answer 2

我在yajl庫周圍找到了另一個 python 包裝器，它是ijson 。

由於以下原因，它比yajl-py更適合我：

yajl-py 沒有在我的系統上檢測到 yajl 庫，我不得不破解代碼以使其工作
ijson 代碼更緊湊，更易於使用
ijson 可以與 yajl v1 和 yajl v2 一起使用，甚至還有純 python yajl 替換
ijson 有非常好的 ObjectBuilder，它不僅有助於從解析的流中提取事件，還提取有意義的子對象，並在您指定的級別

Answer 3

當從本地磁盤訪問大型數據文件時，我發現 yajl（因此是 ijson）比模塊json慢得多。 這是一個聲稱與 Cython 一起使用時性能優於 yajl/ijson（仍然比json慢）的模塊：

http://pietrobattiston.it/jsaone

正如作者所指出的，當通過網絡接收文件時，性能可能比json更好，因為增量解析器可以更快地開始解析。

Answer 4

對於簡單的使用（即遍歷頂級數組中的項目）， json-stream-parser看起來不錯（我沒有使用過）。 它似乎是一個獨立的 JSON 解析器，用 234 行純 Python 從頭開始實現。

它不需要將 JSON 存儲為“每行一個對象”或類似的東西。 JSON 可以全部為一行，也可以有換行符，這無關緊要。

用法：

import sys
from json_stream_parser import load_iter
for obj in load_iter(sys.stdin):
    print(obj)

Answer 5

我已經將 Dask 用於大型遙測 JSON-Lines 文件（換行符分隔）...
Dask 的好處是它為您做了很多工作。
有了它，您可以讀取數據、處理數據並寫入磁盤，而無需將其全部讀入內存。
Dask 還將為您並行化並使用多個內核（線程）...

有關 Dask 包的更多信息，請訪問：
https://examples.dask.org/bag.html

import ujson as json #ujson for speed and handling NaNs which are not covered by JSON spec
import dask.bag as db

def update_dict(d):
    d.update({'new_key':'new_value', 'a':1, 'b':2, 'c':0})
    d['c'] = d['a'] + d['b']
    return d

def read_jsonl(filepaths):
    """Read's a JSON-L file with a Dask Bag

    :param filepaths: list of filepath strings OR a string with wildcard
    :returns: a dask bag of dictionaries, each dict a JSON object
    """
    return db.read_text(filepaths).map(json.loads)



filepaths = ['file1.jsonl.gz','file2.jsonl.gz']
#OR
filepaths = 'file*.jsonl.gz' #wildcard to match multiple files

#(optional) if you want Dask to use multiple processes instead of threads
# from dask.distributed import Client, progress
# client = Client(threads_per_worker=1, n_workers=6) #6 workers for 6 cores
# print(client)

#define bag containing our data with the JSON parser
dask_bag = read_jsonl(filepaths)

#modify our data
#note, this doesn't execute, it just adds it to a queue of tasks
dask_bag.map(update_dict)

#(optional) if you're only reading one huge file but want to split the data into multiple files you can use repartition on the bag
# dask_bag = dask_bag.repartition(10)

#write our modified data back to disk, this is when Dask actually performs execution
dask_bag.map(json.dumps).to_textfiles('file_mod*.jsonl.gz') #dask will automatically apply compression if you use .gz

用 Python 打開一個大的 JSON 文件

問題描述

5 個解決方案

解決方案1
12 2012-05-23 07:53:04

解決方案2
4 2015-04-17 22:25:23

解決方案3
0 2015-08-12 21:15:09

解決方案4
0 2020-05-06 14:11:27

解決方案5
0 2020-10-28 23:27:21

用 Python 打開一個大的 JSON 文件

問題描述

5 個解決方案

解決方案1 12 2012-05-23 07:53:04

解決方案2 4 2015-04-17 22:25:23

解決方案3 0 2015-08-12 21:15:09

解決方案4 0 2020-05-06 14:11:27

解決方案5 0 2020-10-28 23:27:21

解決方案1
12 2012-05-23 07:53:04

解決方案2
4 2015-04-17 22:25:23

解決方案3
0 2015-08-12 21:15:09

解決方案4
0 2020-05-06 14:11:27

解決方案5
0 2020-10-28 23:27:21