读取 Python 中的大 Json 并取一个切片作为样本

Question

I'am dealing a really large json file (6.5GB), with my local machine, it's impossible to read it all at once.我正在处理一个非常大的 json 文件（6.5GB），用我的本地机器，一次读取它是不可能的。 So I want to read a chunk as a testing sample and write code based on this sample before running on the entire dataset.所以我想读取一个块作为测试样本，并在运行整个数据集之前基于这个样本编写代码。

import pandas as pd


file_dir = 'D://yelp_dataset/yelp_academic_dataset_review.json'

df_review_sample = pd.read_json(file_dir, lines=True, chunksize=1000)

I made the following try and then df_review_sample become a JsonReader Object.我做了以下尝试，然后df_review_sample成为 JsonReader Object。 Is there a way to show the first chunk as a dataframe?有没有办法将第一个块显示为 dataframe？

Answer 1

I got the same issue last afternoon, and I finally understood what's going on.我昨天下午遇到了同样的问题，我终于明白发生了什么。

Using the arg lines=True and chunksize=X will create a reader that get specific number of lines.使用参数 lines=True 和 chunksize=X 将创建一个读取特定行数的阅读器。

Then you have to make a loop to display each chunk.然后你必须做一个循环来显示每个块。

Here is a piece of code for you to understand:这里有一段代码供你理解：

import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
    print(chunk)
    break

Chunks create a multiple of chunks according to the lenght of your json (talking in lines).块根据您的 json 的长度创建多个块（按行说话）。 For example, I have a 100 000 lines json with X objects in it, if I do chunksize = 10 000, I will have 10 chunks.例如，我有一个 100 000 行 json ，其中有 X 个对象，如果我做 chunksize = 10 000，我将有 10 个块。

In the code that I gave I added a break in order to just print the first chunk but if you remove it, you will have 10 chunks one by one.在我给出的代码中，我添加了一个中断，以便只打印第一个块，但如果你删除它，你将一个接一个地拥有 10 个块。

读取 Python 中的大 Json 并取一个切片作为样本

问题描述

1 个解决方案

解决方案1
0 2021-04-16 13:24:41

读取 Python 中的大 Json 并取一个切片作为样本

问题描述

1 个解决方案

解决方案1 0 2021-04-16 13:24:41

解决方案1
0 2021-04-16 13:24:41