简体   繁体   中英

Specifying dtype for parquet partition fields with dask.dataframe.read_parquet

I have a parquet dataset structured like:

/path/to/dataset/a=True/b=1/data.parquet
/path/to/dataset/a=False/b=1/data.parquet
/path/to/dataset/a=True/b=2/data.parquet
/path/to/dataset/a=False/b=2/data.parquet
...

how do i specify the dtypes of partition fields (here, a and b ) when calling dd.read_parquet on a directory like this?

i am using the pyarrow engine. do i need to specify a kwarg for a pyarrow function? if so, what would this be?

or, can i just call astype(dict(a="bool", b="int")) or something similar?

later on in my code, I am calling DataFrame.query to filter values, so dtype is important for boolean values, for example.

If you are using pyarrow as the underlying engine you can pass a partitioning argument to specify the schema of the partition.

dask.dataframe.read_parquet will pass that argument along if you provide it. See **kwargs in the doc .

import pyarrow as pa
import pyarrow.dataset as ds

partitioning = ds.partitioning(
    pa.schema([pa.field("a", pa.bool_()), pa.field('b', pa.int32())]),
    flavor="hive"
)
dd.read_parquet(directory="/path/to/dataset/", engine='pyarrow', partitioning=partitioning)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM