Pandas between_time 等效于 Dask DataFrame

Question

I have a Dask dataframe created with dd.read_csv("./*/file.csv") where the * glob is a folder for each date.我有一个使用dd.read_csv("./*/file.csv")创建的 Dask dataframe，其中* glob 是每个日期的文件夹。 In the concatenated dataframe I want to filter out subsets of time like how I would with a pd.between_time("09:30", "16:00") , say.在串联的 dataframe 中，我想过滤掉时间子集，就像我使用pd.between_time("09:30", "16:00")一样。

Because Dask's internal representation of the index does not have the nice features of Pandas's DateTimeIndex, I haven' had any success with filtering how I normally would in Pandas.因为 Dask 的索引内部表示没有 Pandas 的 DateTimeIndex 的优良特性，所以我在过滤 Pandas 中的正常方式方面没有取得任何成功。 Short of resorting to a naive mapping function/loop, I am unable to get this to work in Dask.没有求助于一个简单的映射函数/循环，我无法让它在 Dask 中工作。

Since the partitions are by date, perhaps that could be exploited by converting to a Pandas dataframe and then back to a Dask partition, but it seems like there should be a better way.由于分区是按日期划分的，也许可以通过转换为 Pandas dataframe 然后回到 Dask 分区来利用它，但似乎应该有更好的方法。

Updating with the example used in Angus' answer.更新安格斯答案中使用的示例。

I guess I don't understand the logic of the queries in the answers/comments.我想我不明白答案/评论中查询的逻辑。 Is Pandas smart enough to not interpret the boolean mask literally as a string and do the correct datetime comparisons? Pandas 是否足够聪明，不会将 boolean 掩码逐字解释为字符串并进行正确的日期时间比较？

Answer 1

Filtering in Dask works just like pandas with a few convenience functions removed. Dask 中的过滤就像 pandas 一样工作，但删除了一些便利功能。

For example if you had the following data:例如，如果您有以下数据：

time,A,B
6/18/2020 09:00,29,0.330799201
6/18/2020 10:15,30,0.518081116
6/18/2020 18:25,31,0.790506469

The following code:以下代码：

import dask.dataframe as dd

df = dd.read_csv('*.csv', parse_dates=['time']).set_index('time')
df.loc[(df.index > "09:30") & (df.index < "16:00")].compute()

(If ran on 18th June 2020) Would return: （如果在 2020 年 6 月 18 日运行）将返回：

time,A,B
2020-06-18 10:15:00,30,0.518081

EDIT:编辑：

The above answer filters for the current date only;以上答案仅过滤当前日期； pandas interprets the time string as a datetime value with the current date. pandas 将时间字符串解释为具有当前日期的日期时间值。 If you'd like to filter values for all days between specific times there's a workaround to strip the date from the datetime column:如果您想过滤特定时间之间所有日期的值，有一种解决方法可以从 datetime 列中删除日期：

import dask.dataframe as dd

df = dd.read_csv('*.csv',parse_dates=['time'])
df["time_of_day"] = dd.to_datetime(df["time"].dt.time.astype(str))
df.loc[(df.time_of_day > "09:30") & (df.time_of_day < "16:00")].compute()

Bear in mind there might be a speed penalty to this method, possibly a concern for larger datasets.请记住，此方法可能会降低速度，这可能是对较大数据集的担忧。

Pandas between_time 等效于 Dask DataFrame

问题描述

1 个解决方案

解决方案1
3 2020-06-18 22:50:20

EDIT:编辑：

Pandas between_time 等效于 Dask DataFrame

问题描述

1 个解决方案

解决方案1 3 2020-06-18 22:50:20

EDIT:编辑：

解决方案1
3 2020-06-18 22:50:20