I am trying to split a parquet file using DASK with the following piece of code
import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)
I have only one physical file in input, ie file.parquet
The output of this script is as well only one file, ie part.0.parquet.
Based on the partition_size & chunksize parameters, I should have multiple files in output
Any help would be appreciated
df.repartition(partition_size="100MB")
returns a Dask Dataframe .
You have to write :
df = df.repartition(partition_size="100MB")
You can check the number of partitions created looking at df.npartitions
Also, you can use the following to write your parquet files :
df.to_parquet(output_path)
Because Parquet files are meant to deal with large files , you should also consider using the argument compression=
when writing you parquet files.
You should get what you expect .
NB: Writing import dask.dataframe as pd
is missleading because import dask.dataframe as dd
is commonly used
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.