I have this datafame:
df = pd.DataFrame({'time' : [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'value' : [0.10, 0.25, 0.40, 0.24, 0.20, 0.36, 0.31, 0.20, 0.32, 0.40],
'quantity_A' : [1, 2, 3, 1, 2, 1, 1, 2, 1, 1],
'quantity_B' : [2, 2, 3, 4, 2, 2, 3, 4, 1, 1]})
that looks like this:
time value quantity_A quantity_B
0 1 0.10 1 2
1 1 0.25 2 2
2 1 0.40 3 3
3 1 0.24 1 4
4 1 0.20 2 2
5 2 0.36 1 2
6 2 0.31 1 3
7 2 0.20 2 4
8 2 0.32 1 1
9 2 0.40 1 1
I want to have something like that:
time interval quantity_A quantity_B
0 1 [0.1, 0.2] 3 4
1 1 (0.2, 0.3] 3 6
2 1 (0.3, 0.4] 3 3
3 2 [0.2, 0.3] 2 4
4 2 (0.3, 0.4] 4 7
or this would be preferred but it seems harder to do, cause it doesn't work with cut:
time interval quantity_A quantity_B
0 1 0.1 1 2
1 1 0.2 0 0
2 1 0.3 5 8
3 1 0.4 3 3
4 2 0.2 2 4
5 2 0.3 3 6
6 2 0.4 1 1
Where the dataframe is grouped by time
and the interval
is dependent on the min
and max
of a group with a step size that can be specified, in this case, 0.1. quantity_A
and quantity_B
should be summed up depending on which group and interval they are in. I have managed to do this manually by iterating over the whole dataframe but since my dataset is hugh it takes a long time. Is there a way to do this with pandas functions like groupby
and cut
to speed this up?
Edit: min and max should be the minimum and maximum value of value
of each group. In this case the group with time == 1 has a min = 0.1 and max = 0.4 and for the group with time == 2, min = 0.2 and max = 0.4 if there was a value like 0.54 in group 2 it would be the max value
Not sure if there is a built-in method available but, if you were flexible on the intervals so that is it [0.2, 0.3) instead of (0.2, 0.3] then the following would work:
# one way to truncate the second decimal place
df['value'] = (df['value'] * 10).astype(int) / 10
# rename the column
df.rename(columns={'value': 'interval'}, inplace=True)
# groupby which works same as interval [x ,y) instead of (x, y]
df = df.groupby(['time', 'interval']).sum().reset_index()
Output:
time interval quantity_A quantity_B
0 1 0.1 1 2
1 1 0.2 5 8
2 1 0.4 3 3
3 2 0.2 2 4
4 2 0.3 3 6
5 2 0.4 1 1
Using pandas.cut
per group:
step = 0.1
(df
.groupby('time', group_keys=False)
.apply(lambda g:
g.assign(interval=pd.cut(df['value'],
bins=np.arange(g['value'].min(),
g['value'].max()*1.01,
step),
include_lowest=True)
)
)
.drop(columns='value')
.groupby(['time', 'interval'])
.sum().reset_index()
)
output:
time interval quantity_A quantity_B
0 1 (0.099, 0.2] 3 4
1 1 (0.2, 0.3] 3 6
2 1 (0.3, 0.4] 3 3
3 2 (0.199, 0.3] 2 4
4 2 (0.3, 0.4] 4 7
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.