Groupby连续值和聚合

Question

This is my dataset (pandas DataFrame df ): 这是我的数据集（pandas DataFrame df ）：

DateTime              INDICATOR
2017-01-01 10:35:00   0
2017-01-01 10:40:00   0
2017-01-01 10:45:00   0
2017-01-01 10:50:00   0
2017-01-01 10:55:00   0
2017-01-01 11:00:00   0
2017-01-01 11:05:00   1
2017-01-01 11:10:00   1
2017-01-01 11:15:00   1
2017-01-01 11:20:00   1
2017-01-01 11:25:00   0
2017-01-01 11:30:00   0
2017-01-01 11:35:00   1
2017-01-01 11:40:00   1
2017-01-01 11:45:00   1

The column DateTime is of the type datetime64[ns] . DateTime列的类型为datetime64[ns] 。

I want to obtain the duration (in minutes) of the data segments where INDICATOR is equal to 1. 我想获得INDICATOR等于1的数据段的持续时间（以分钟为单位）。

The expected result is: 预期的结果是：

[15, 10]

This is how I tried to solve this task but I receive all 0 values: 这是我尝试解决此任务的方式，但我收到所有0值：

s=df["INDICATOR"].eq(1)
df1=df[s].copy()
s1=df1.groupby(s.cumsum())["DateTime"].transform(lambda x : x.max()-x.min()).dt.seconds

All values of s1 are 0. s1所有值都是0。

Answer 1

First, create groupID by using: 首先，使用以下方法创建groupID：

gb_ID = df.INDICATOR.diff().ne(0).cumsum()

Next, pick only INDICATOR == 1 and doing groupby by gb_ID . 接下来，只选择INDICATOR == 1并通过gb_ID进行groupby 。 Find max , min of DateTime per gb_ID. 查找每个gb_ID的DateTime max ， min 。 Find diff of this max , min . 找到这个max ， min diff 。 Finally, pick columns not NaT to convert it to int of minutes and call values to return array. 最后，选择列而不是NaT将其转换为分钟的int并调用values以返回数组。

df.query('INDICATOR == 1').groupby(gb_ID)['DateTime'].agg(['min', 'max']) \
                          .diff(axis=1)['max'].dt.seconds.floordiv(60).values

Out[351]: array([15, 10], dtype=int64)

Below is the dataframe before picking non- NaT and values 下面是选择非NaT和values之前的数据帧

df.query('INDICATOR == 1').groupby(gb_ID)['DateTime'].agg(['min', 'max']).diff(axis=1)

Out[362]:
          min      max
INDICATOR
2         NaT 00:15:00
4         NaT 00:10:00

Answer 2

Taking this post into account I was thinking to split the dataframe into subframes with np.split() . 考虑到这篇文章，我想用np.split()将数据帧分成子帧。

Try this: 尝试这个：

from numpy import nan

# split df on condition that indicator is 0
splitted_dfs = np.split(df, *np.where(df. INDICATOR == 0))

results = []

for split in splitted_dfs:
    # iloc[1:] omits the first 0 entry of the splitted df
    results.append(split.iloc[1:].index.max() - split.iloc[1:].index.min())

print([int(x.seconds / 60) for x in results if x.seconds is not nan])

# prints to [15, 10]

Explanation 说明

np.split() with condition INDICATOR == 0 makes a split on every row where the condition is met. 具有条件INDICATOR == 0 np.split()在满足条件的每一行上进行拆分。 This yields this list of dataframes: 这产生了这个数据帧列表：

2017-01-01 10:35:00          0, INDICATOR

2017-01-01 10:40:00          0, INDICATOR

2017-01-01 10:45:00          0, INDICATOR

2017-01-01 10:50:00          0, INDICATOR

2017-01-01 10:55:00          0, INDICATOR

2017-01-01 11:00:00          0
2017-01-01 11:05:00          1
2017-01-01 11:10:00          1
2017-01-01 11:15:00          1
2017-01-01 11:20:00          1, INDICATOR

2017-01-01 11:25:00          0, INDICATOR

2017-01-01 11:30:00          0
2017-01-01 11:35:00          1
2017-01-01 11:40:00          1
2017-01-01 11:45:00          1

You can iterate over that list, ignore the empty ones and remove the first 0 entry of the relevant ones. 您可以迭代该列表，忽略空列表并删除相关列表的前0个条目。

Groupby连续值和聚合

问题描述

2 个解决方案

解决方案1
3 2019-06-09 21:17:32

解决方案2
0 2019-06-10 06:38:51

Groupby连续值和聚合

问题描述

2 个解决方案

解决方案1 3 2019-06-09 21:17:32

解决方案2 0 2019-06-10 06:38:51

解决方案1
3 2019-06-09 21:17:32

解决方案2
0 2019-06-10 06:38:51