PyArrow / Dask to_parquet partition all null columns

Question

When writing Dask dataframe partitions to parquet I've noticed that reading_parquet fails on conflicting meta data / schemas. This is because in some of the partitions column(s) are entirely null / np.nan and in others they are filled with values.

Beforehand I've casted the data types of my partitions:

df = df.astype(dtypes)

PyArrow fails to read my partitioned parquet files, because columns with only nulls are reassigned with datatype 'null'. How do I tackle this issue? Some of the partitions have columns with all nulls, while in others they are not entirely null.

Data types of columns are either integer, float or strings (object).

Answer 1

我建议在Dask或Arrow问题追踪器上提出问题

PyArrow / Dask to_parquet partition all null columns

Question

1 answers

solution1
0 2019-09-21 00:55:28

PyArrow / Dask to_parquet partition all null columns

Question

1 answers

solution1 0 2019-09-21 00:55:28

solution1
0 2019-09-21 00:55:28