Specifying dtype for parquet partition fields with dask.dataframe.read_parquet

Question

I have a parquet dataset structured like:

/path/to/dataset/a=True/b=1/data.parquet
/path/to/dataset/a=False/b=1/data.parquet
/path/to/dataset/a=True/b=2/data.parquet
/path/to/dataset/a=False/b=2/data.parquet
...

how do i specify the dtypes of partition fields (here, a and b ) when calling dd.read_parquet on a directory like this?

i am using the pyarrow engine. do i need to specify a kwarg for a pyarrow function? if so, what would this be?

or, can i just call astype(dict(a="bool", b="int")) or something similar?

later on in my code, I am calling DataFrame.query to filter values, so dtype is important for boolean values, for example.

Answer 1

If you are using pyarrow as the underlying engine you can pass a partitioning argument to specify the schema of the partition.

dask.dataframe.read_parquet will pass that argument along if you provide it. See **kwargs in the doc .

import pyarrow as pa
import pyarrow.dataset as ds

partitioning = ds.partitioning(
    pa.schema([pa.field("a", pa.bool_()), pa.field('b', pa.int32())]),
    flavor="hive"
)
dd.read_parquet(directory="/path/to/dataset/", engine='pyarrow', partitioning=partitioning)

Specifying dtype for parquet partition fields with dask.dataframe.read_parquet

Question

1 answers

solution1
0 2022-06-01 08:39:54

Specifying dtype for parquet partition fields with dask.dataframe.read_parquet

Question

1 answers

solution1 0 2022-06-01 08:39:54

solution1
0 2022-06-01 08:39:54