简体   繁体   中英

Reading Parquet files in Dask returns empty dataframe

I'm trying to replicate this datashader+parquet+dask+dash example in order to do something similar with my own data. Here's the git code.

The steps to replicate include running the jupyter notebook in order to convert the 4 gig csv file into a parquet file. I can run this code without issue, it creates a parquet directory with many ~70 mb sized files in it, but when I try to read the parquet file it returns an empty dataframe (but with the correct columns). So after reading the csv into a dask dataframe and doing some processing I can check the head():

ddf.head()
radio   mcc net area    cell    unit    lon lat range   samples changeable  created updated averageSignal   x_3857  y_3857
0   UMTS    262 2   801 86355   0   13.285512   52.522202   1000    7   1   1282569574000000000 1300155341000000000 0   1.478936e+06    6.895103e+06
1   GSM 262 2   801 1795    0   13.276907   52.525714   5716    9   1   1282569574000000000 1300155341000000000 0   1.477979e+06    6.895745e+06
2   GSM 262 2   801 1794    0   13.285064   52.524000   6280    13  1   1282569574000000000 1300796207000000000 0   1.478887e+06    6.895432e+06
3   UMTS    262 2   801 211250  0   13.285446   52.521744   1000    3   1   1282569574000000000 1299466955000000000 0   1.478929e+06    6.895019e+06
4   UMTS    262 2   801 86353   0   13.293457   52.521515   1000    2   1   1282569574000000000 1291380444000000000 0   1.479821e+06    6.894977e+06

write it to parquet:

# Write parquet file to ../data directory
os.makedirs('./data', exist_ok=True)
parquet_path = './data/cell_towers.parq'
ddf.to_parquet(parquet_path, 
               compression='snappy',  
               write_metadata_file = True)

and attempt to read from parquet:

ddy = dd.read_parquet('./data/cell_towers.parq' ) 

but it returns and empty dataframe, but with the correct column names:

ddy.head(3)
> radio mcc net area    cell    unit    lon lat range   samples changeable  created updated averageSignal   x_3857  y_3857

len(ddy)
> 0 

This is the first time I'm using dask dataframes and parquet, it seems like it should just work but there might be some basic concept I'm missing here.

Small replicable code snippet:

import pandas as pd
import dask.dataframe as dd

ddfx = dd.from_pandas(pd.DataFrame(range(10), columns=['A']), npartitions=2)
parquet_path = './dummy.parq'
ddfx.to_parquet(parquet_path, 
               compression='snappy',  
               write_metadata_file = True)
ddfy = dd.read_parquet('./dummy.parq' ) 
print('Input DDF length: {0} . Output DDF length: {1}'.format(len(ddfx), len(ddfy)))

Input DDF length: 10 . Output DDF length: 0

How can I write a DDF to parquet and then read it?

I am unable to reproduce the error using dask=2022.05.2 . There might be some version incompatibility, so I'd recommend installing dask, pandas and fastparquet in a dedicated environment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM