[英]Pandas groupby than cut into intervals of the min/max of the group
I have this datafame:我有这个数据名:
df = pd.DataFrame({'time' : [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'value' : [0.10, 0.25, 0.40, 0.24, 0.20, 0.36, 0.31, 0.20, 0.32, 0.40],
'quantity_A' : [1, 2, 3, 1, 2, 1, 1, 2, 1, 1],
'quantity_B' : [2, 2, 3, 4, 2, 2, 3, 4, 1, 1]})
that looks like this:看起来像这样:
time value quantity_A quantity_B
0 1 0.10 1 2
1 1 0.25 2 2
2 1 0.40 3 3
3 1 0.24 1 4
4 1 0.20 2 2
5 2 0.36 1 2
6 2 0.31 1 3
7 2 0.20 2 4
8 2 0.32 1 1
9 2 0.40 1 1
I want to have something like that:我想要这样的东西:
time interval quantity_A quantity_B
0 1 [0.1, 0.2] 3 4
1 1 (0.2, 0.3] 3 6
2 1 (0.3, 0.4] 3 3
3 2 [0.2, 0.3] 2 4
4 2 (0.3, 0.4] 4 7
or this would be preferred but it seems harder to do, cause it doesn't work with cut:或者这将是首选,但它似乎更难做到,因为它不适用于 cut:
time interval quantity_A quantity_B
0 1 0.1 1 2
1 1 0.2 0 0
2 1 0.3 5 8
3 1 0.4 3 3
4 2 0.2 2 4
5 2 0.3 3 6
6 2 0.4 1 1
Where the dataframe is grouped by time
and the interval
is dependent on the min
and max
of a group with a step size that can be specified, in this case, 0.1.其中 dataframe 按
time
分组, interval
取决于可指定步长的组的min
和max
,在本例中为 0.1。 quantity_A
and quantity_B
should be summed up depending on which group and interval they are in. I have managed to do this manually by iterating over the whole dataframe but since my dataset is hugh it takes a long time. quantity_A
和quantity_B
应该根据它们所在的组和间隔来求和。我设法通过遍历整个 dataframe 手动完成此操作,但由于我的数据集很大,因此需要很长时间。 Is there a way to do this with pandas functions like groupby
and cut
to speed this up?有没有办法用 pandas 函数来做到这一点,比如
groupby
和cut
以加快速度?
Edit: min and max should be the minimum and maximum value of value
of each group.编辑:最小值和最大值应该是每组
value
的最小值和最大值。 In this case the group with time == 1 has a min = 0.1 and max = 0.4 and for the group with time == 2, min = 0.2 and max = 0.4 if there was a value like 0.54 in group 2 it would be the max value在这种情况下,时间 == 1 的组有 min = 0.1 和 max = 0.4,对于时间 == 2 的组,min = 0.2 和 max = 0.4 如果第 2 组中有类似 0.54 的值,它将是最大值
Not sure if there is a built-in method available but, if you were flexible on the intervals so that is it [0.2, 0.3) instead of (0.2, 0.3] then the following would work:不确定是否有可用的内置方法,但是,如果您在间隔上很灵活,那么它是 [0.2, 0.3) 而不是 (0.2, 0.3] 那么以下将起作用:
# one way to truncate the second decimal place
df['value'] = (df['value'] * 10).astype(int) / 10
# rename the column
df.rename(columns={'value': 'interval'}, inplace=True)
# groupby which works same as interval [x ,y) instead of (x, y]
df = df.groupby(['time', 'interval']).sum().reset_index()
Output: Output:
time interval quantity_A quantity_B
0 1 0.1 1 2
1 1 0.2 5 8
2 1 0.4 3 3
3 2 0.2 2 4
4 2 0.3 3 6
5 2 0.4 1 1
Using pandas.cut
per group:每组使用
pandas.cut
:
step = 0.1
(df
.groupby('time', group_keys=False)
.apply(lambda g:
g.assign(interval=pd.cut(df['value'],
bins=np.arange(g['value'].min(),
g['value'].max()*1.01,
step),
include_lowest=True)
)
)
.drop(columns='value')
.groupby(['time', 'interval'])
.sum().reset_index()
)
output: output:
time interval quantity_A quantity_B
0 1 (0.099, 0.2] 3 4
1 1 (0.2, 0.3] 3 6
2 1 (0.3, 0.4] 3 3
3 2 (0.199, 0.3] 2 4
4 2 (0.3, 0.4] 4 7
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.