简体   繁体   中英

Split a parquet file in smaller chunks using dask

I am trying to split a parquet file using DASK with the following piece of code

import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)

I have only one physical file in input, ie file.parquet

The output of this script is as well only one file, ie part.0.parquet.

Based on the partition_size & chunksize parameters, I should have multiple files in output

Any help would be appreciated

df.repartition(partition_size="100MB") returns a Dask Dataframe .

You have to write :

df = df.repartition(partition_size="100MB")

You can check the number of partitions created looking at df.npartitions

Also, you can use the following to write your parquet files :

df.to_parquet(output_path)

Because Parquet files are meant to deal with large files , you should also consider using the argument compression= when writing you parquet files.

You should get what you expect .

NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM