简体   繁体   中英

Save multiple parquet files from dask dataframe

I would like to save multiple parquet files from a Dask dataframe, one parquet file for all unique values in a specific column. Hence, the number of parquet file should be equal to the number of unique values in that column.

For example, given the following dataframe, I want to save four parquet files, because there a four unique values in column "A".

import pandas as pd
from dask import dataframe as dd

df = pd.DataFrame(
    {
        "A": [1, 1, 2, 3, 1, 3, 6, 6],
        "B": ["A", "L", "C", "D", "A", "B", "A", "B"],
        "C": [1, 2, 3, 4, 5, 6, 7, 8],
    }
)
ddf = dd.from_pandas(df, npartitions=2)

for i in ddf["A"].unique().compute():
    ddf.loc[ddf["A"] == i].to_parquet(f"file_{i}.parquet", schema="infer")

I am not sure if looping over the Dask dataframe is the right approach to scale this up (probably the unique().compute() can be bigger than my memory). Moreover I am unsure if I have to order beforehand.

If you have some suggestions how to properly implement this or things to take into account, I would be happy!

This is not exactly what you are after, but it's possible to use partition_on option of .to_parquet :

ddf.to_parquet("file_parquet", schema="infer", partition_on="A")

Note that this does not guarantee one file per partition as you want, instead there will be subfolders inside file_parquet , containing potentially more than one file.

You can achieve this by setting the index to the column of interest and setting the divisions to follow the unique values in that column.

This should do the trick:

import dask.dataframe as dd
import pandas as pd
import numpy as np

# create dummy dataset with 3 partitions
df = pd.DataFrame(
    {"letter": ["a", "b", "c", "a", "a", "d", "d", "b", "c", "b", "a", "b", "c", "e", "e", "e"], "number": np.arange(0,16)}
)

ddf = dd.from_pandas(df, npartitions=3)

# set index to column of interest
ddf = ddf.set_index('letter').persist()

# generate list of divisions (last value needs to be repeated)
index_values = list(df.letter.unique())
divisions = index_values.append(df.letter.unique()[-1])

# repartition 
ddf = ddf.repartition(divisions=divisions).persist()

# write out partitions as separate parquet files
for i in range(ddf.npartitions):
    ddf.partitions[i].to_parquet(f"file_{i}.parquet", engine='pyarrow')

Note the double occurrence of the value 'e' in the list of divisions. As per the Dask docs : "Divisions includes the minimum value of every partition's index and the maximum value of the last partition's index." This means the last value needs to be included twice since it serves as both the start of and the end of the last partition's index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM