[英]summing up the values in a column from groupedby dataframe in pandas
Here is my pandas.DataFrame: 这是我的pandas.DataFrame:
a b
0 1 5
1 1 7
2 2 3
3 1 3
4 2 5
5 2 6
6 1 4
7 1 3
8 2 7
9 2 4
10 2 5
I want to create a new DataFrame that will contain the data grouped by 'a' and contains the sum of the largest 3 values for each group. 我想创建一个新的DataFrame,它将包含按“ a”分组的数据,并包含每个组的最大3个值的总和。
Here is the output I expect. 这是我期望的输出。 The largest 3 values of 'b' for group 1 are 7,5 and 4, and for group 2 are 7, 6 and 5. 组1的'b'的最大3个值是7,5和4,组2的'b'的最大值是7、6和5。
a
1 16
2 18
df.groupby('a')['b'].nlargest(3)
gives me this output, 给我这个输出,
a
1 1 7
0 5
6 4
2 8 7
5 6
10 5
and 和
df.groupby('a')['b'].nlargest(3).sum()
gives me the total sum 34 (16+18). 给我总计34(16 + 18)。
How can I get the expected output with pandas.DataFrame? 如何使用pandas.DataFrame获得预期的输出?
Thank you! 谢谢!
Using apply
is one way to do it. 使用apply
是一种方法。
In [41]: df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
Out[41]:
a
1 16
2 18
Name: b, dtype: int64
Timings 时机
In [42]: dff = pd.concat([df]*1000).reset_index(drop=True)
In [43]: dff.shape
Out[43]: (11000, 2)
In [44]: %timeit dff.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
100 loops, best of 3: 2.44 ms per loop
In [45]: %timeit dff.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
100 loops, best of 3: 3.44 ms per loop
Use double groupby
- second by level a
of MultiIndex
: 使用双groupby
由第二级- a
的MultiIndex
:
s = df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
print (s)
a
1 16
2 18
Name: b, dtype: int64
But for me is nicer: 但是对我来说更好:
df.groupby('a')['b'].nlargest(3).sum(level=0)
thank you Nickil Maveli
. 谢谢Nickil Maveli
。
EDIT: If need top 3
again, use Series.nlargest
: 编辑:如果需要顶部3
再次,使用Series.nlargest
:
df = pd.DataFrame({'a': [1, 1, 2, 3, 2, 2, 1, 3, 4, 3, 4],
'b': [5, 7, 3, 3, 5, 6, 4, 3, 7, 4, 5]})
print (df)
a b
0 1 5
1 1 7
2 2 3
3 3 3
4 2 5
5 2 6
6 1 4
7 3 3
8 4 7
9 3 4
10 4 5
df = df.groupby('a')['b'].nlargest(3).sum(level=0).nlargest(3)
print (df)
a
1 16
2 14
4 12
Name: b, dtype: int64
Timings : 时间 :
np.random.seed(123)
N = 1000000
L2 = np.arange(100)
df = pd.DataFrame({'b':np.random.randint(20, size=N),
'a': np.random.choice(L2, N)})
print (df)
In [22]: %timeit df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
10 loops, best of 3: 125 ms per loop
In [23]: %timeit df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
10 loops, best of 3: 121 ms per loop
In [29]: %timeit df.groupby('a')['b'].nlargest(3).sum(level=0)
10 loops, best of 3: 121 ms per loop
np.random.seed(123)
N = 1000000
L2 = list('abcdefghijklmno')
df = pd.DataFrame({'b':np.random.randint(20, size=N),
'a': np.random.choice(L2, N)})
print (df)
In [19]: %timeit df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
10 loops, best of 3: 97.9 ms per loop
In [20]: %timeit df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
10 loops, best of 3: 96.5 ms per loop
In [31]: %timeit df.groupby('a')['b'].nlargest(3).sum(level=0)
10 loops, best of 3: 97.9 ms per loop
np.random.seed(123)
N = 1000000
L2 = list('abcde')
df = pd.DataFrame({'b':np.random.randint(20, size=N),
'a': np.random.choice(L2, N)})
print (df)
In [25]: %timeit df.groupby('a')['b'].apply(lambda x: x.nlargest(3).sum())
10 loops, best of 3: 82 ms per loop
In [26]: %timeit df.groupby('a')['b'].nlargest(3).groupby(level='a').sum()
10 loops, best of 3: 81.9 ms per loop
In [33]: %timeit df.groupby('a')['b'].nlargest(3).sum(level=0)
10 loops, best of 3: 82.5 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.