使用groupby仅对列pandas python之一具有相同值的行进行操作

Question

如何创建仅对列具有相同值的行子集完成操作的groupby函数？

因此，在下表中，我想基于同一文档列表对行进行子集，然后仅对“组织”列的文档列表子集添加NP和Pr。

        Organization NP Pr
    0    doclist[0]  0   0
    1    doclist[0]  1   0
    4    doclist[1]  1   0
    5    doclist[4]  1   0
    6    doclist[4]  0   1

想在下面使用.apply（）-还是有更好/更有效的方法？

        Organization NP Pr  Sum
    0    doclist[0]  0   0   1
    1    doclist[0]  1   0   1
    4    doclist[1]  1   0   1
    5    doclist[4]  1   0   2
    6    doclist[4]  0   1   2

Answer 1

我想看一下groupby ，那是“仅对其中一列具有相同值的行进行操作”部分，并且由于您似乎希望每一行都获得适当的总和，所以我认为您想要在其上调用.transform 。 transform已分组的值“广播”到整个数据帧。

df["Sum"] = (df["NP"] + df["Pr"]).groupby(df["Organization"]).transform("sum")

例如：

>>> df
  Organization  NP  Pr
0   doclist[0]   0   0
1   doclist[0]   1   0
4   doclist[1]   1   0
5   doclist[4]   1   0
6   doclist[4]   0   1

[5 rows x 3 columns]
>>> df["Sum"] = (df["NP"] + df["Pr"]).groupby(df["Organization"]).transform("sum")
>>> df
  Organization  NP  Pr  Sum
0   doclist[0]   0   0    1
1   doclist[0]   1   0    1
4   doclist[1]   1   0    1
5   doclist[4]   1   0    2
6   doclist[4]   0   1    2

[5 rows x 4 columns]

Answer 2

可能有一种更有效的方法，（您可以编写得更加可读），但是您始终可以执行以下操作：

import pandas as pd

org = ['doclist[0]', 'doclist[0]', 'doclist[1]', 'doclist[4]', 'doclist[4]']
np = [0, 1, 1, 1, 0]
pr = [0, 0, 0, 0, 1]
df = pd.DataFrame({'Organization':org, 'NP':np, 'Pr':pr})

# Make a "lookup" dataframe of the sums for each category
# (Both the "NP" and "Pr" colums of "sums" will contain the result)
sums = df.groupby('Organization').agg(lambda x: x['NP'].sum() + x['Pr'].sum())

# Lookup the result based on the contents of the "Organization" row
df['Sum'] = df.apply(lambda row: sums.ix[row['Organization']]['NP'], axis=1)

这是相当难以理解的，因此以这种方式编写它可能会更清晰一些：

import pandas as pd

org = ['doclist[0]', 'doclist[0]', 'doclist[1]', 'doclist[4]', 'doclist[4]']
np = [0, 1, 1, 1, 0]
pr = [0, 0, 0, 0, 1]
df = pd.DataFrame({'Organization':org, 'NP':np, 'Pr':pr})

# Make a "lookup" dataframe of the sums for each category
lookup = df.groupby('Organization').agg(lambda x: x['NP'].sum() + x['Pr'].sum())

# Lookup the result based on the contents of the "Organization" row
# The "lookup" dataframe will have the relevant sum in _both_ "NP" and "Pr"
def func(row):
    org = row['Organization']
    group_sum = lookup.ix[org]['NP']
    return group_sum
df['Sum'] = df.apply(func, axis=1)

顺便说一句，@ DSM看起来是一种更好的方法。

使用groupby仅对列pandas python之一具有相同值的行进行操作

问题描述

2 个解决方案

解决方案1
4 2014-03-12 19:59:32

解决方案2
2 已采纳 2014-03-12 19:55:42

使用groupby仅对列pandas python之一具有相同值的行进行操作

问题描述

2 个解决方案

解决方案1 4 2014-03-12 19:59:32

解决方案2 2 已采纳 2014-03-12 19:55:42

解决方案1
4 2014-03-12 19:59:32

解决方案2
2 已采纳 2014-03-12 19:55:42