How to split a csv file into multiple files using Dask?
The bellow code seems to write to one file only which takes a long time to write the full thing. I believe writing to multiple files will be faster.
import dask.dataframe as ddf
import dask
file_path = "file_name.csv"
df = ddf.read_csv(file_path)
futs = df.to_csv(r"*.csv", compute=False)
_, l = dask.compute(futs, df.size)
I suspect that when you read df
you have df.npartitions
is just 1
.
import dask.dataframe as dd
file_path = "file_name.csv"
df = dd.read_csv(file_path)
# set how many file you would like to have
# in this case 10
df = df.repartition(npartitions=10)
df.to_csv("file_*.csv")
But as far as I can see it's not faster.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.