Split a parquet file in smaller chunks using dask

Question

I am trying to split a parquet file using DASK with the following piece of code

import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)

I have only one physical file in input, ie file.parquet

The output of this script is as well only one file, ie part.0.parquet.

Based on the partition_size & chunksize parameters, I should have multiple files in output

Any help would be appreciated

Answer 1

df.repartition(partition_size="100MB") returns a Dask Dataframe .

You have to write :

df = df.repartition(partition_size="100MB")

You can check the number of partitions created looking at df.npartitions

Also, you can use the following to write your parquet files :

df.to_parquet(output_path)

Because Parquet files are meant to deal with large files , you should also consider using the argument compression= when writing you parquet files.

You should get what you expect .

NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used

Split a parquet file in smaller chunks using dask

Question

1 answers

solution1
3 ACCPTED 2020-01-24 22:59:07

Split a parquet file in smaller chunks using dask

Question

1 answers

solution1 3 ACCPTED 2020-01-24 22:59:07

solution1
3 ACCPTED 2020-01-24 22:59:07