简体   繁体   English

Pandas groupby 比切入组的最小/最大间隔

[英]Pandas groupby than cut into intervals of the min/max of the group

I have this datafame:我有这个数据名:

df = pd.DataFrame({'time' : [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
                   'value' : [0.10, 0.25, 0.40, 0.24, 0.20, 0.36, 0.31, 0.20, 0.32, 0.40],
                   'quantity_A' : [1, 2, 3, 1, 2, 1, 1, 2, 1, 1],
                   'quantity_B' : [2, 2, 3, 4, 2, 2, 3, 4, 1, 1]})

that looks like this:看起来像这样:

   time  value  quantity_A  quantity_B
0     1   0.10           1           2
1     1   0.25           2           2
2     1   0.40           3           3
3     1   0.24           1           4
4     1   0.20           2           2
5     2   0.36           1           2
6     2   0.31           1           3
7     2   0.20           2           4
8     2   0.32           1           1
9     2   0.40           1           1

I want to have something like that:我想要这样的东西:

   time      interval  quantity_A  quantity_B
0     1    [0.1, 0.2]           3           4
1     1    (0.2, 0.3]           3           6
2     1    (0.3, 0.4]           3           3
3     2    [0.2, 0.3]           2           4
4     2    (0.3, 0.4]           4           7

or this would be preferred but it seems harder to do, cause it doesn't work with cut:或者这将是首选,但它似乎更难做到,因为它不适用于 cut:

   time      interval  quantity_A  quantity_B
0     1           0.1           1           2
1     1           0.2           0           0
2     1           0.3           5           8
3     1           0.4           3           3
4     2           0.2           2           4
5     2           0.3           3           6
6     2           0.4           1           1

Where the dataframe is grouped by time and the interval is dependent on the min and max of a group with a step size that can be specified, in this case, 0.1.其中 dataframe 按time分组, interval取决于可指定步长的组的minmax ,在本例中为 0.1。 quantity_A and quantity_B should be summed up depending on which group and interval they are in. I have managed to do this manually by iterating over the whole dataframe but since my dataset is hugh it takes a long time. quantity_Aquantity_B应该根据它们所在的组和间隔来求和。我设法通过遍历整个 dataframe 手动完成此操作,但由于我的数据集很大,因此需要很长时间。 Is there a way to do this with pandas functions like groupby and cut to speed this up?有没有办法用 pandas 函数来做到这一点,比如groupbycut以加快速度?

Edit: min and max should be the minimum and maximum value of value of each group.编辑:最小值和最大值应该是每组value的最小值和最大值。 In this case the group with time == 1 has a min = 0.1 and max = 0.4 and for the group with time == 2, min = 0.2 and max = 0.4 if there was a value like 0.54 in group 2 it would be the max value在这种情况下,时间 == 1 的组有 min = 0.1 和 max = 0.4,对于时间 == 2 的组,min = 0.2 和 max = 0.4 如果第 2 组中有类似 0.54 的值,它将是最大值

Not sure if there is a built-in method available but, if you were flexible on the intervals so that is it [0.2, 0.3) instead of (0.2, 0.3] then the following would work:不确定是否有可用的内置方法,但是,如果您在间隔上很灵活,那么它是 [0.2, 0.3) 而不是 (0.2, 0.3] 那么以下将起作用:

# one way to truncate the second decimal place
df['value'] = (df['value'] * 10).astype(int) / 10

# rename the column
df.rename(columns={'value': 'interval'}, inplace=True)

# groupby which works same as interval [x ,y) instead of (x, y]
df = df.groupby(['time', 'interval']).sum().reset_index()

Output: Output:

    time    interval    quantity_A  quantity_B
0   1        0.1        1           2
1   1        0.2        5           8
2   1        0.4        3           3
3   2        0.2        2           4
4   2        0.3        3           6
5   2        0.4        1           1

Using pandas.cut per group:每组使用pandas.cut

step = 0.1

(df
   .groupby('time', group_keys=False)
   .apply(lambda g:
          g.assign(interval=pd.cut(df['value'],
                                   bins=np.arange(g['value'].min(),
                                                  g['value'].max()*1.01,
                                                  step),
                                   include_lowest=True)
                  )
         )
   .drop(columns='value')
   .groupby(['time', 'interval'])
   .sum().reset_index()
)

output: output:

   time      interval  quantity_A  quantity_B
0     1  (0.099, 0.2]           3           4
1     1    (0.2, 0.3]           3           6
2     1    (0.3, 0.4]           3           3
3     2  (0.199, 0.3]           2           4
4     2    (0.3, 0.4]           4           7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM