如何在pandas中讀取一個大的json？

Question

我的代碼是： data_review=pd.read_json('review.json')我的數據review如下：

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

但我收到以下錯誤：

    333             fh, handles = _get_handle(filepath_or_buffer, 'r',
    334                                       encoding=encoding)
--> 335             json = fh.read()
    336             fh.close()
    337         else:

OSError: [Errno 22] Invalid argument

我的jsonfile不包含任何注釋和3.8G！ 我只是從這里下載文件來練習鏈接

當我使用以下代碼時，拋出相同的錯誤：

import json
with open('review.json') as json_file:
    data = json.load(json_file)

Answer 1

也許，您正在讀取的文件包含多個 json 對象，而不是json.load(json_file)和pd.read_json('review.json')所期望的單個 json 或數組對象。 這些方法應該使用單個 json 對象讀取文件。

從我看到的 yelp 數據集來看，您的文件必須包含以下內容：

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0}
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0}
....    
....

and so on.

因此，重要的是要意識到這不是單個 json 數據，而是一個文件中的多個 json 對象。

要將這些數據讀入 Pandas 數據框，以下解決方案應該有效：

import pandas as pd

with open('review.json') as json_file:      
    data = json_file.readlines()
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

假設數據的大小非常大，我認為您的機器將花費大量時間將數據加載到數據框中。

Answer 2

如果您不想使用 for 循環，以下應該可以解決問題：

import pandas as pd

df = pd.read_json("foo.json", lines=True)

這將處理您的 json 文件與此類似的情況：

{"foo": "bar"}
{"foo": "baz"}
{"foo": "qux"}

並將其轉換為由單列foo和三行組成的 DataFrame。

您可以在 Panda 的文檔中閱讀更多內容

Answer 3

使用參數 lines=True 和 chunksize=X 將創建一個讀取特定行數的讀取器。

然后你必須做一個循環來顯示每個塊。

這是一段代碼供您理解：

import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
    print(chunk)
    break

塊根據您的 json 的長度創建多個塊（按行說話）。 例如，我有一個 100 000 行的 json，其中有 X 個對象，如果我執行 chunksize = 10 000，我將有 10 個塊。

在我給出的代碼中，我添加了一個中斷，以便只打印第一個塊，但是如果刪除它，您將一個接一個地得到 10 個塊。

Answer 4

我正在即興創作 Max 的答案，將一個大的 json 文件加載到 dataframe 中，而不會遇到 memory 錯誤：

您可以使用以下代碼，不會遇到任何問題。

chunks = pd.read_json('/content/gdrive/My Drive/yelp/yelp_academic_dataset_review.json', lines=True, chunksize = 10000)
reviews = pd.DataFrame()
for chunk in chunks:
  reviews = pd.concat([reviews, chunk])

Answer 5

如果您的 json 文件包含多個對象而不是一個對象，則應執行以下操作：

import json

data = []
for line in open('sample.json', 'r'):
    data.append(json.loads(line))

注意json.load和json.loads之間的區別。

json.loads() 需要一個（有效的）JSON 字符串 - 即 {"foo": "bar"}。 因此，如果您的 json 文件看起來像@Mant1c0r3 提到的那樣，那么json.loads將是合適的。

如何在pandas中讀取一個大的json？

問題描述

5 個解決方案

解決方案1
11 已采納 2017-11-22 00:15:45

解決方案2
11 2018-10-06 02:35:36

解決方案3
1 2021-04-16 13:28:29

解決方案4
1 2022-10-02 21:09:35

解決方案5
0 2021-08-08 17:29:26

如何在pandas中讀取一個大的json？

問題描述

5 個解決方案

解決方案1 11 已采納 2017-11-22 00:15:45

解決方案2 11 2018-10-06 02:35:36

解決方案3 1 2021-04-16 13:28:29

解決方案4 1 2022-10-02 21:09:35

解決方案5 0 2021-08-08 17:29:26

解決方案1
11 已采納 2017-11-22 00:15:45

解決方案2
11 2018-10-06 02:35:36

解決方案3
1 2021-04-16 13:28:29

解決方案4
1 2022-10-02 21:09:35

解決方案5
0 2021-08-08 17:29:26