简体   繁体   中英

Losing index information when using dask.dataframe.to_parquet() with partitioning

When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue:

from datetime import timedelta
from pathlib import Path

import dask.dataframe as dd
import pandas as pd

REPORT_DATE_TEST = pd.to_datetime('2019-01-01').date()
path = Path('/home/ludwik/Documents/YieldPlanet/research/trials/')

observations_nr = 3
dtas = range(0, observations_nr)
rds = [REPORT_DATE_TEST - timedelta(days=days) for days in dtas]
data_to_export = pd.DataFrame({
    'report_date': rds,
    'dta': dtas,
    'stay_date': [REPORT_DATE_TEST] * observations_nr,
    }) \
    .set_index('dta')

data_to_export_dask = dd.from_pandas(data_to_export, npartitions=1)

file_name = 'trial.parquet'
data_to_export_dask.to_parquet(path / file_name,
                               engine='pyarrow',
                               compression='snappy',
                               partition_on=['report_date'],
                               write_index=True
                              )

data_read = dd.read_parquet(path / file_name, engine='pyarrow')
print(data_read)

Which gives:

| | stay_date  |dta| report_date|
|0| 2019-01-01 | 2 | 2018-12-30 |
|0| 2019-01-01 | 1 | 2018-12-31 |
|0| 2019-01-01 | 0 | 2019-01-01 |

I did not see that described anywhere in the dask documentation.

Does anyone know how to save the index while partitioning the parquet data?

The issue was in the pyarrow's backend. I filed a bug report on their JIRA webpage: https://issues.apache.org/jira/browse/ARROW-7782

I might seem like an attempt to sidestep the question, but my suggestion would be to partition along the index. This would also ensure non-overlapping indexes in the partitions.

This would be like dd.from_pandas(data_to_export, npartitions=3) and then skip partition_on and write_index in to_parquet . The index would have to be sorted.

This preserves the index and sets the divisions correctly.

Note that you are not guaranteed to get exact the number of partitions you request with partitions , especially not with small data sets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM