Given a DataFrame with an ID column and corresponding values column, how can I aggregate (let's say sum) the values within blocks of repeating IDs?
Example DF:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'id': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'b', 'a', 'b', 'b', 'b'],
'v': np.ones(15)}
)
Note that there's only two unique IDs, so a simple groupby('id')
won't work. Also, the IDs don't alternate/repeat in a regular manner. What I came up with was to recreate the index, to represent the blocks of changed IDs:
# where id changes:
m = [True] + list(df['id'].values[:-1] != df['id'].values[1:])
# generate a new index from m:
idx, i = [], -1
for b in m:
if b:
i += 1
idx.append(i)
# set as index:
df = df.set_index(np.array(idx))
# now I can use groupby:
df.groupby(df.index)['v'].sum()
# 0 5.0
# 1 3.0
# 2 2.0
# 3 1.0
# 4 1.0
# 5 3.0
This re-creation of the index feels sort-of not how you'd do this in pandas
. What did I miss? Is there a better way to do this?
Here is necessary create helper Series
with compare shifted values for not equal by ne
with cumulative sums and pass to groupby
, for id
column is possible pass together in list, remove first level of MultiIndex by first reset_index(level=0, drop=True)
and then convert index to column id
:
print (df['id'].ne(df['id'].shift()).cumsum())
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 4
11 5
12 6
13 6
14 6
Name: id, dtype: int32
df1 = (df.groupby([df['id'].ne(df['id'].shift()).cumsum(), 'id'])['v'].sum()
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
id v
0 a 5.0
1 b 3.0
2 a 2.0
3 b 1.0
4 a 1.0
5 b 3.0
Another idea is use GroupBy.agg
with dictioanry and aggregate id
column by GroupBy.first
:
df1 = (df.groupby(df['id'].ne(df['id'].shift()).cumsum(), as_index=False)
.agg({'id':'first', 'v':'sum'}))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.