简体   繁体   English

Pandas between_time 等效于 Dask DataFrame

[英]Pandas between_time equivalent for Dask DataFrame

I have a Dask dataframe created with dd.read_csv("./*/file.csv") where the * glob is a folder for each date.我有一个使用dd.read_csv("./*/file.csv")创建的 Dask dataframe,其中* glob 是每个日期的文件夹。 In the concatenated dataframe I want to filter out subsets of time like how I would with a pd.between_time("09:30", "16:00") , say.在串联的 dataframe 中,我想过滤掉时间子集,就像我使用pd.between_time("09:30", "16:00")一样。

Because Dask's internal representation of the index does not have the nice features of Pandas's DateTimeIndex, I haven' had any success with filtering how I normally would in Pandas.因为 Dask 的索引内部表示没有 Pandas 的 DateTimeIndex 的优良特性,所以我在过滤 Pandas 中的正常方式方面没有取得任何成功。 Short of resorting to a naive mapping function/loop, I am unable to get this to work in Dask.没有求助于一个简单的映射函数/循环,我无法让它在 Dask 中工作。

Since the partitions are by date, perhaps that could be exploited by converting to a Pandas dataframe and then back to a Dask partition, but it seems like there should be a better way.由于分区是按日期划分的,也许可以通过转换为 Pandas dataframe 然后回到 Dask 分区来利用它,但似乎应该有更好的方法。


Updating with the example used in Angus' answer.更新安格斯答案中使用的示例。

在此处输入图像描述

I guess I don't understand the logic of the queries in the answers/comments.我想我不明白答案/评论中查询的逻辑。 Is Pandas smart enough to not interpret the boolean mask literally as a string and do the correct datetime comparisons? Pandas 是否足够聪明,不会将 boolean 掩码逐字解释为字符串并进行正确的日期时间比较?

Filtering in Dask works just like pandas with a few convenience functions removed. Dask 中的过滤就像 pandas 一样工作,但删除了一些便利功能。

For example if you had the following data:例如,如果您有以下数据:

time,A,B
6/18/2020 09:00,29,0.330799201
6/18/2020 10:15,30,0.518081116
6/18/2020 18:25,31,0.790506469

The following code:以下代码:

import dask.dataframe as dd

df = dd.read_csv('*.csv', parse_dates=['time']).set_index('time')
df.loc[(df.index > "09:30") & (df.index < "16:00")].compute()

(If ran on 18th June 2020) Would return: (如果在 2020 年 6 月 18 日运行)将返回:

time,A,B
2020-06-18 10:15:00,30,0.518081

EDIT:编辑:

The above answer filters for the current date only;以上答案仅过滤当前日期; pandas interprets the time string as a datetime value with the current date. pandas 将时间字符串解释为具有当前日期的日期时间值。 If you'd like to filter values for all days between specific times there's a workaround to strip the date from the datetime column:如果您想过滤特定时间之间所有日期的值,有一种解决方法可以从 datetime 列中删除日期:

import dask.dataframe as dd

df = dd.read_csv('*.csv',parse_dates=['time'])
df["time_of_day"] = dd.to_datetime(df["time"].dt.time.astype(str))
df.loc[(df.time_of_day > "09:30") & (df.time_of_day < "16:00")].compute()

Bear in mind there might be a speed penalty to this method, possibly a concern for larger datasets.请记住,此方法可能会降低速度,这可能是对较大数据集的担忧。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM