简体   繁体   中英

How do I filter dask.dataframe.read_parquet with timestamp?

I am trying to read some parquet files using dask.dataframe.read_parquet method. In the data I have a column named timestamp , which contains data such as:

0     2018-12-20 19:00:00
1     2018-12-20 20:00:00
2     2018-12-20 21:00:00
3     2018-12-20 22:00:00
4     2018-12-20 23:00:00
5     2018-12-21 00:00:00
6     2018-12-21 01:00:00
7     2018-12-21 02:00:00
8     2018-12-21 03:00:00
9     2018-12-21 04:00:00
10    2018-12-21 05:00:00
11    2018-12-21 06:00:00
12    2018-12-21 07:00:00
13    2018-12-21 08:00:00
14    2018-12-21 09:00:00
15    2018-12-21 10:00:00
16    2018-12-21 11:00:00
17    2018-12-21 12:00:00
18    2018-12-21 13:00:00
19    2018-12-21 14:00:00
20    2018-12-21 15:00:00

and I would like to filter based on timestamp and return say, data within the last 10 days. How do I do this?

I tried something like:

filter_timestamp_days = pd.Timestamp(datetime.today() - timedelta(days=days))
filters = [('timestamp', '>', filter_timestamp_days)]
df = dask_df.read_parquet(DATA_DIR, engine='pyarrow', filters=filters)

But I am getting the error:

TypeError: Cannot compare type 'Timestamp' with type 'bytes_'

It turned out that the problem was from the data source I was working with. I tested a different data source originally written with dask and it worked simply as:

filter_timestamp_days = pd.Timestamp(datetime.today() - timedelta(days=days))
filters = [('timestamp', '>', filter_timestamp_days)]
df = dask_df.read_parquet(DATA_DIR, engine='fastparquet', filters=filters)

I did not need to convert filter_timestamp_days any further. The former data source was written with a Scala client and it seems somehow the metadata is not readable in dask.

Thank you all for your contributions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM