熊猫groupby与重叠的组/窗口

Question

I suspect this use just isn't compatible with groupby , so maybe I'm actually asking for a different pattern that matches what I want. 我怀疑这种用法与groupby不兼容，所以也许我实际上是在要求与我想要的模式不同的模式。 I have a dataframe of events w/ timespans and want to be able to iterate over/apply functions to the rows for each day. 我有一个带有时间跨度的事件的数据框，并希望能够每天对行进行遍历/将函数应用于行。 But if a row starts in one day and ends in another, I want that row to be included in both. 但是，如果一行在一天中开始，另一天结束，那么我希望该行同时包含在这两行中。

start = pd.DatetimeIndex(start='2018-02-01 21:00:00',
                         end='2018-02-05, 21:00:00', freq='6h')
df = pd.DataFrame({'start': start.date, 'end': start.shift(1).date, 'value': 1}, 
                  columns=['start', 'end', 'value'])

         start         end  value
0   2018-02-01  2018-02-02      1
1   2018-02-02  2018-02-02      1
2   2018-02-02  2018-02-02      1
3   2018-02-02  2018-02-02      1
4   2018-02-02  2018-02-03      1
5   2018-02-03  2018-02-03      1
6   2018-02-03  2018-02-03      1
7   2018-02-03  2018-02-03      1
8   2018-02-03  2018-02-04      1
9   2018-02-04  2018-02-04      1
10  2018-02-04  2018-02-04      1
11  2018-02-04  2018-02-04      1
12  2018-02-04  2018-02-05      1
13  2018-02-05  2018-02-05      1
14  2018-02-05  2018-02-05      1
15  2018-02-05  2018-02-05      1
16  2018-02-05  2018-02-06      1

So the first group should contain [0, ..., 4] , then [4, ..., 8] , etc. In practice the events aren't evenly spaced so the lengths (in rows) of each day won't be constant. 因此，第一组应包含[0, ..., 4] ，然后包含[4, ..., 8]等。实际上，事件的间隔不是均匀的，因此每天的长度（以行为单位）不会一定是常数

The closest I've managed is starting with groupby.indices and manipulating the groups to match what I want, but this feels pretty gross. 我所管理的最接近的方法是从groupby.indices开始，然后操纵组以匹配我想要的内容，但这感觉很groupby.indices 。

{k: np.append(v[0] - 1, v) for k, v in df.groupby('start').indices.items() 
 if not (len(v) == 1 and v[0] == 0)}

{Timestamp('2018-02-02 00:00:00'): array([0, 1, 2, 3, 4]),
 Timestamp('2018-02-03 00:00:00'): array([4, 5, 6, 7, 8]),
 Timestamp('2018-02-04 00:00:00'): array([ 8,  9, 10, 11, 12]),
 Timestamp('2018-02-05 00:00:00'): array([12, 13, 14, 15, 16])}

Answer 1

I believe you want to aggregate . 相信您要aggregate 。 There are many ways to go, for example 有很多方法可以去，例如

def e(inp):
    return [inp.index]

>>> df.groupby('end').aggregate(e)['start']

end
2018-02-02        [[0, 1, 2, 3]]
2018-02-03        [[4, 5, 6, 7]]
2018-02-04      [[8, 9, 10, 11]]
2018-02-05    [[12, 13, 14, 15]]
2018-02-06                [[16]]
Name: start, dtype: object

and 和

df.groupby('start').aggregate(e)['end']
start
2018-02-01                 [[0]]
2018-02-02        [[1, 2, 3, 4]]
2018-02-03        [[5, 6, 7, 8]]
2018-02-04     [[9, 10, 11, 12]]
2018-02-05    [[13, 14, 15, 16]]
Name: end, dtype: object

Now, you can play with these series, eg the following yields your output 现在，您可以玩这些系列游戏，例如，以下内容可以产生输出

merged = (df.groupby('end').aggregate(e)['start'] + df.groupby('start').aggregate(e)['end']).dropna()
merged.apply(lambda k: k[0].union(k[1]))

2018-02-02         Int64Index([0, 1, 2, 3, 4], dtype='int64')
2018-02-03         Int64Index([4, 5, 6, 7, 8], dtype='int64')
2018-02-04      Int64Index([8, 9, 10, 11, 12], dtype='int64')
2018-02-05    Int64Index([12, 13, 14, 15, 16], dtype='int64')

Answer 2

First, I would concatenate the start and end data and name the result column date such as: 首先，我将连接start和end数据并命名结果列date例如：

df_concat = pd.DataFrame(pd.concat([df.start,df.end]),columns=['date'])

Then I would create a column with index: 然后，我将创建一个带有索引的列：

df_concat['index'] = df_concat.apply(lambda x: x.name,axis=1)

And finally a groupby and apply such as: 最后是groupby并apply例如：

df_concat.groupby('date')['index'].apply(lambda x: sorted(set(x)))

The output is like: 输出如下：

date
2018-02-01                     [0]
2018-02-02         [0, 1, 2, 3, 4]
2018-02-03         [4, 5, 6, 7, 8]
2018-02-04      [8, 9, 10, 11, 12]
2018-02-05    [12, 13, 14, 15, 16]
2018-02-06                    [16]
Name: index, dtype: object

As @RafaelC said, there are many ways, this one is with apply and not aggregate , and I don't remove the dates with only one value in the corresponding list 就像@RafaelC所说的那样，有很多方法，这是apply而不是aggregate ，我不会在相应列表中只删除一个值的日期

熊猫groupby与重叠的组/窗口

问题描述

2 个解决方案

解决方案1
2 2018-04-13 18:37:51

解决方案2
0 2018-04-13 21:25:54

熊猫groupby与重叠的组/窗口

问题描述

2 个解决方案

解决方案1 2 2018-04-13 18:37:51

解决方案2 0 2018-04-13 21:25:54

解决方案1
2 2018-04-13 18:37:51

解决方案2
0 2018-04-13 21:25:54