[英]Using groupby to operate only on rows that have the same value for one of the columns pandas python
如何创建仅对列具有相同值的行子集完成操作的groupby函数?
因此,在下表中,我想基于同一文档列表对行进行子集,然后仅对“组织”列的文档列表子集添加NP和Pr。
Organization NP Pr
0 doclist[0] 0 0
1 doclist[0] 1 0
4 doclist[1] 1 0
5 doclist[4] 1 0
6 doclist[4] 0 1
想在下面使用.apply()-还是有更好/更有效的方法?
Organization NP Pr Sum
0 doclist[0] 0 0 1
1 doclist[0] 1 0 1
4 doclist[1] 1 0 1
5 doclist[4] 1 0 2
6 doclist[4] 0 1 2
我想看一下groupby
,那是“仅对其中一列具有相同值的行进行操作”部分,并且由于您似乎希望每一行都获得适当的总和,所以我认为您想要在其上调用.transform
。 transform
已分组的值“广播”到整个数据帧。
df["Sum"] = (df["NP"] + df["Pr"]).groupby(df["Organization"]).transform("sum")
例如:
>>> df
Organization NP Pr
0 doclist[0] 0 0
1 doclist[0] 1 0
4 doclist[1] 1 0
5 doclist[4] 1 0
6 doclist[4] 0 1
[5 rows x 3 columns]
>>> df["Sum"] = (df["NP"] + df["Pr"]).groupby(df["Organization"]).transform("sum")
>>> df
Organization NP Pr Sum
0 doclist[0] 0 0 1
1 doclist[0] 1 0 1
4 doclist[1] 1 0 1
5 doclist[4] 1 0 2
6 doclist[4] 0 1 2
[5 rows x 4 columns]
可能有一种更有效的方法,(您可以编写得更加可读),但是您始终可以执行以下操作:
import pandas as pd
org = ['doclist[0]', 'doclist[0]', 'doclist[1]', 'doclist[4]', 'doclist[4]']
np = [0, 1, 1, 1, 0]
pr = [0, 0, 0, 0, 1]
df = pd.DataFrame({'Organization':org, 'NP':np, 'Pr':pr})
# Make a "lookup" dataframe of the sums for each category
# (Both the "NP" and "Pr" colums of "sums" will contain the result)
sums = df.groupby('Organization').agg(lambda x: x['NP'].sum() + x['Pr'].sum())
# Lookup the result based on the contents of the "Organization" row
df['Sum'] = df.apply(lambda row: sums.ix[row['Organization']]['NP'], axis=1)
这是相当难以理解的,因此以这种方式编写它可能会更清晰一些:
import pandas as pd
org = ['doclist[0]', 'doclist[0]', 'doclist[1]', 'doclist[4]', 'doclist[4]']
np = [0, 1, 1, 1, 0]
pr = [0, 0, 0, 0, 1]
df = pd.DataFrame({'Organization':org, 'NP':np, 'Pr':pr})
# Make a "lookup" dataframe of the sums for each category
lookup = df.groupby('Organization').agg(lambda x: x['NP'].sum() + x['Pr'].sum())
# Lookup the result based on the contents of the "Organization" row
# The "lookup" dataframe will have the relevant sum in _both_ "NP" and "Pr"
def func(row):
org = row['Organization']
group_sum = lookup.ix[org]['NP']
return group_sum
df['Sum'] = df.apply(func, axis=1)
顺便说一句,@ DSM看起来是一种更好的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.