I have a parquet dataset structured like:
/path/to/dataset/a=True/b=1/data.parquet
/path/to/dataset/a=False/b=1/data.parquet
/path/to/dataset/a=True/b=2/data.parquet
/path/to/dataset/a=False/b=2/data.parquet
...
how do i specify the dtypes of partition fields (here, a
and b
) when calling dd.read_parquet
on a directory like this?
i am using the pyarrow
engine. do i need to specify a kwarg for a pyarrow function? if so, what would this be?
or, can i just call astype(dict(a="bool", b="int"))
or something similar?
later on in my code, I am calling DataFrame.query
to filter values, so dtype is important for boolean values, for example.
If you are using pyarrow
as the underlying engine you can pass a partitioning argument to specify the schema of the partition.
dask.dataframe.read_parquet
will pass that argument along if you provide it. See **kwargs
in the doc .
import pyarrow as pa
import pyarrow.dataset as ds
partitioning = ds.partitioning(
pa.schema([pa.field("a", pa.bool_()), pa.field('b', pa.int32())]),
flavor="hive"
)
dd.read_parquet(directory="/path/to/dataset/", engine='pyarrow', partitioning=partitioning)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.