[英]Pandas groupby with overlapping groups / windows
I suspect this use just isn't compatible with groupby
, so maybe I'm actually asking for a different pattern that matches what I want. 我怀疑这种用法与
groupby
不兼容,所以也许我实际上是在要求与我想要的模式不同的模式。 I have a dataframe of events w/ timespans and want to be able to iterate over/apply functions to the rows for each day. 我有一个带有时间跨度的事件的数据框,并希望能够每天对行进行遍历/将函数应用于行。 But if a row starts in one day and ends in another, I want that row to be included in both.
但是,如果一行在一天中开始,另一天结束,那么我希望该行同时包含在这两行中。
start = pd.DatetimeIndex(start='2018-02-01 21:00:00',
end='2018-02-05, 21:00:00', freq='6h')
df = pd.DataFrame({'start': start.date, 'end': start.shift(1).date, 'value': 1},
columns=['start', 'end', 'value'])
start end value
0 2018-02-01 2018-02-02 1
1 2018-02-02 2018-02-02 1
2 2018-02-02 2018-02-02 1
3 2018-02-02 2018-02-02 1
4 2018-02-02 2018-02-03 1
5 2018-02-03 2018-02-03 1
6 2018-02-03 2018-02-03 1
7 2018-02-03 2018-02-03 1
8 2018-02-03 2018-02-04 1
9 2018-02-04 2018-02-04 1
10 2018-02-04 2018-02-04 1
11 2018-02-04 2018-02-04 1
12 2018-02-04 2018-02-05 1
13 2018-02-05 2018-02-05 1
14 2018-02-05 2018-02-05 1
15 2018-02-05 2018-02-05 1
16 2018-02-05 2018-02-06 1
So the first group should contain [0, ..., 4]
, then [4, ..., 8]
, etc. In practice the events aren't evenly spaced so the lengths (in rows) of each day won't be constant. 因此,第一组应包含
[0, ..., 4]
,然后包含[4, ..., 8]
等。实际上,事件的间隔不是均匀的,因此每天的长度(以行为单位)不会一定是常数
The closest I've managed is starting with groupby.indices
and manipulating the groups to match what I want, but this feels pretty gross. 我所管理的最接近的方法是从
groupby.indices
开始,然后操纵组以匹配我想要的内容,但这感觉很groupby.indices
。
{k: np.append(v[0] - 1, v) for k, v in df.groupby('start').indices.items()
if not (len(v) == 1 and v[0] == 0)}
{Timestamp('2018-02-02 00:00:00'): array([0, 1, 2, 3, 4]),
Timestamp('2018-02-03 00:00:00'): array([4, 5, 6, 7, 8]),
Timestamp('2018-02-04 00:00:00'): array([ 8, 9, 10, 11, 12]),
Timestamp('2018-02-05 00:00:00'): array([12, 13, 14, 15, 16])}
I believe you want to aggregate
. 相信您要
aggregate
。 There are many ways to go, for example 有很多方法可以去,例如
def e(inp):
return [inp.index]
>>> df.groupby('end').aggregate(e)['start']
end
2018-02-02 [[0, 1, 2, 3]]
2018-02-03 [[4, 5, 6, 7]]
2018-02-04 [[8, 9, 10, 11]]
2018-02-05 [[12, 13, 14, 15]]
2018-02-06 [[16]]
Name: start, dtype: object
and 和
df.groupby('start').aggregate(e)['end']
start
2018-02-01 [[0]]
2018-02-02 [[1, 2, 3, 4]]
2018-02-03 [[5, 6, 7, 8]]
2018-02-04 [[9, 10, 11, 12]]
2018-02-05 [[13, 14, 15, 16]]
Name: end, dtype: object
Now, you can play with these series, eg the following yields your output 现在,您可以玩这些系列游戏,例如,以下内容可以产生输出
merged = (df.groupby('end').aggregate(e)['start'] + df.groupby('start').aggregate(e)['end']).dropna()
merged.apply(lambda k: k[0].union(k[1]))
2018-02-02 Int64Index([0, 1, 2, 3, 4], dtype='int64')
2018-02-03 Int64Index([4, 5, 6, 7, 8], dtype='int64')
2018-02-04 Int64Index([8, 9, 10, 11, 12], dtype='int64')
2018-02-05 Int64Index([12, 13, 14, 15, 16], dtype='int64')
First, I would concatenate the start
and end
data and name the result column date
such as: 首先,我将连接
start
和end
数据并命名结果列date
例如:
df_concat = pd.DataFrame(pd.concat([df.start,df.end]),columns=['date'])
Then I would create a column with index: 然后,我将创建一个带有索引的列:
df_concat['index'] = df_concat.apply(lambda x: x.name,axis=1)
And finally a groupby
and apply
such as: 最后是
groupby
并apply
例如:
df_concat.groupby('date')['index'].apply(lambda x: sorted(set(x)))
The output is like: 输出如下:
date
2018-02-01 [0]
2018-02-02 [0, 1, 2, 3, 4]
2018-02-03 [4, 5, 6, 7, 8]
2018-02-04 [8, 9, 10, 11, 12]
2018-02-05 [12, 13, 14, 15, 16]
2018-02-06 [16]
Name: index, dtype: object
As @RafaelC said, there are many ways, this one is with apply
and not aggregate
, and I don't remove the dates with only one value in the corresponding list 就像@RafaelC所说的那样,有很多方法,这是
apply
而不是aggregate
,我不会在相应列表中只删除一个值的日期
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.