简体   繁体   中英

Read Large Json in Python and take a slice as a sample

I'am dealing a really large json file (6.5GB), with my local machine, it's impossible to read it all at once. So I want to read a chunk as a testing sample and write code based on this sample before running on the entire dataset.

import pandas as pd


file_dir = 'D://yelp_dataset/yelp_academic_dataset_review.json'

df_review_sample = pd.read_json(file_dir, lines=True, chunksize=1000)

I made the following try and then df_review_sample become a JsonReader Object. Is there a way to show the first chunk as a dataframe?

I got the same issue last afternoon, and I finally understood what's going on.

Using the arg lines=True and chunksize=X will create a reader that get specific number of lines.

Then you have to make a loop to display each chunk.

Here is a piece of code for you to understand:

import pandas as pd
import json
chunks = pd.read_json('../input/data.json', lines=True, chunksize = 10000)
for chunk in chunks:
    print(chunk)
    break

Chunks create a multiple of chunks according to the lenght of your json (talking in lines). For example, I have a 100 000 lines json with X objects in it, if I do chunksize = 10 000, I will have 10 chunks.

In the code that I gave I added a break in order to just print the first chunk but if you remove it, you will have 10 chunks one by one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM