[英]Pandas - group by consecutive ranges
I have a dataframe with the following structure - Start, End and Height. 我有一个具有以下结构的数据帧 - 开始,结束和高度。
Some properties of the dataframe: 数据框的一些属性:
I'd like to group the dataframe in a way that heights will be grouped in buckets of 5 longs ie the buckets are 0, 1-5, 6-10, 11-15 and >15 . 我想以一种高度将分组为5个长度的桶的方式对数据帧进行分组,即桶是0,1-5,6-10,11-15和> 15 。
See code example below where what I'm looking for is the implemetation of group_by_bucket function. 请参阅下面的代码示例,其中我正在寻找的是group_by_bucket函数的实现。
I tried looking at other questions but couldn't get exact answer to what I was looking for. 我试着查看其他问题,但无法得到我正在寻找的确切答案。
Thanks in advance! 提前致谢!
>>> d = pd.DataFrame([[1,3,5], [4,10,7], [11,17,6], [18,26, 12], [27,30, 15], [31,40,6], [41, 42, 7]], columns=['start','end', 'height'])
>>> d
start end height
0 1 3 8
1 4 10 7
2 11 17 6
3 18 26 12
4 27 30 15
5 31 40 6
6 41 42 7
>>> d_gb = group_by_bucket(d)
>>> d_gb
start end height_grouped
0 1 17 6_10
1 18 30 11_15
2 31 42 6_10
A way to do that : 一种方法:
df = pd.DataFrame([[1,3,10], [4,10,7], [11,17,6], [18,26, 12],
[27,30, 15], [31,40,6], [41, 42, 6]], columns=['start','end', 'height'])
Use cut
to make groups : 使用cut
制作组:
df['groups']=pd.cut(df.height,[-1,0,5,10,15,1000])
Find break points : 找到断点:
df['categories']=(df.groups!=df.groups.shift()).cumsum()
Then df
is : 然后df
是:
"""
start end height groups categories
0 1 3 10 (5, 10] 0
1 4 10 7 (5, 10] 0
2 11 17 6 (5, 10] 0
3 18 26 12 (10, 15] 1
4 27 30 15 (10, 15] 1
5 31 40 6 (5, 10] 2
6 41 42 6 (5, 10] 2
"""
Define interesting data : 定义有趣的数据:
f = {'start':['first'],'end':['last'], 'groups':['first']}
And use the groupby.agg
function : 并使用groupby.agg
函数:
df.groupby('categories').agg(f)
"""
groups end start
first last first
categories
0 (5, 10] 17 1
1 (10, 15] 30 18
2 (5, 10] 42 31
"""
You can use cut
with groupby
by cut
and Series
with cumsum
for generating groups and aggregate by agg
, first
and last
: 您可以使用cut
与groupby
通过cut
和Series
与cumsum
用于产生组和汇总agg
, first
和last
:
bins = [-1,0,1,5,10,15,100]
print bins
[-1, 0, 1, 5, 10, 15, 100]
cut_ser = pd.cut(d['height'], bins=bins)
print cut_ser
0 (5, 10]
1 (5, 10]
2 (5, 10]
3 (10, 15]
4 (10, 15]
5 (5, 10]
6 (5, 10]
Name: height, dtype: category
Categories (6, object): [(-1, 0] < (0, 1] < (1, 5] < (5, 10] < (10, 15] < (15, 100]]
print (cut_ser.shift() != cut_ser).cumsum()
0 0
1 0
2 0
3 1
4 1
5 2
6 2
Name: height, dtype: int32
print d.groupby([(cut_ser.shift() != cut_ser).cumsum(), cut_ser])
.agg({'start' : 'first','end' : 'last'})
.reset_index(level=1).reset_index(drop=True)
.rename(columns={'height':'height_grouped'})
height_grouped start end
0 (5, 10] 1 17
1 (10, 15] 18 30
2 (5, 10] 31 42
EDIT: 编辑:
Timings : 时间 :
In [307]: %timeit a(df)
100 loops, best of 3: 5.45 ms per loop
In [308]: %timeit b(d)
The slowest run took 4.45 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.28 ms per loop
Code : 代码 :
d = pd.DataFrame([[1,3,5], [4,10,7], [11,17,6], [18,26, 12], [27,30, 15], [31,40,6], [41, 42, 7]], columns=['start','end', 'height'])
print d
df = d.copy()
def a(df):
df['groups']=pd.cut(df.height,[-1,0,5,10,15,1000])
df['categories']=(df.groups!=df.groups.shift()).cumsum()
f = {'start':['first'],'end':['last'], 'groups':['first']}
return df.groupby('categories').agg(f)
def b(d):
bins = [-1,0,1,5,10,15,100]
cut_ser = pd.cut(d['height'], bins=bins)
return d.groupby([(cut_ser.shift() != cut_ser).cumsum(), cut_ser]).agg({'start' : 'first','end' : 'last'}).reset_index(level=1).reset_index(drop=True).rename(columns={'height':'height_grouped'})
print a(df)
print b(d)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.