简体   繁体   中英

Using Dask to parallelize read JSON -> save Parquet

I'd like to use Dask to ingest a large (>2 GB, > 1M lines) line-delimited JSON and save as a batch of Parquet files. I'm running these experiments on my personal computer, so the file is larger than the RAM available. Attempting to load the entire JSON file into memory results in a memory error.

With Pandas, I can use read_json() to create a JsonReader object and then iterate over chunks in a for loop:

reader = pd.read_json(file, orient='records', lines=True, chunksize=rows)
i=1
for chunk in reader:
    chunk.to_parquet(f'part{i:02d}.parquet')
    i = i+1

This works as expected and produces the expected collection of parquet files.

I know that Dask has somewhat similar parameters for read_json (blocksize uses bytes instead of rows), but I cannot get parallelization to work properly. Based on my understanding of Dask examples , I've written this code:

import dask
import dask.dataframe as ddf
import dask.delayed as dd

@dask.delayed
def save(chunk, dest_dir):
    chunk.to_parquet(dest_dir, name_function=lambda i: f'part{i:02d}.parquet')

def f(reader, dest_dir):
    for chunk in reader:
        save(chunk, dest_dir)

reader = ddf.read_json(file, orient='records', lines=True, blocksize=block_bytes)
dask.compute(f(reader, dest_dir))

However, it seems that no chunks get processed and no parquet files are produced.

For reference, the following fails with a memory error:

import dask.dataframe as ddf
ddf.read_json(file, orient='records', lines=True, blocksize=block_bytes).to_parquet(dest_dir, name_function=lambda i: f'part{i:02d}.parquet')

You need to call compute on each save operation for them to actually happen:

def f(reader, dest_dir):
    return [save(chunk, dest_dir) for chunk in reader]

delayeds = f(reader, dest_dir)
dask.compute(*delayed)

See this question: Compute list of dask delayed object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM