简体   繁体   English

根据列的总和添加新列,并在Pandas中按2个其他列分组

[英]Add new column based on sum of a column and grouped by 2 other columns in Pandas

I have the dataframe: 我有数据帧:

df = pd.DataFrame({'Continent':['North America','North America','North America','Europe','Europe','Europe','Europe'],
                'Country': ['US','Canada','Mexico','France','Germany','Spain','Italy'],
                'Status': ['Member','Non-Member','Non-Member','Member','Non-Member','Member','Non-Member'],
                'Units': [27,5,4,10,15,8,8]})

print df

       Continent  Country      Status  Units
0  North America       US      Member     27
1  North America   Canada  Non-Member      5
2  North America   Mexico  Non-Member      4
3         Europe   France      Member     10
4         Europe  Germany  Non-Member     15
5         Europe    Spain      Member      8
6         Europe    Italy  Non-Member      8

I need to add 2 columns which are summary statistics about the Continents. 我需要添加2列,这些列是关于大陆的摘要统计信息。 I need a column with the sum of Units for Member countries and Non Member countries. 我需要一个列有成员国和非成员国单位总和的专栏。

so that the final output would look like: 这样最终输出看起来像:

       Continent  Member Units  Non-Member Units  Country      Status  Units
0  North America            27                 9       US      Member     27
1  North America            27                 9   Canada  Non-Member      5
2  North America            27                 9   Mexico  Non-Member      4
3         Europe            18                23   France      Member     10
4         Europe            18                23  Germany  Non-Member     15
5         Europe            18                23    Spain      Member      8
6         Europe            18                23    Italy  Non-Member      8

It seems like I need to use groupby but I can't figure out how to take the groupby values and re-insert them into the dataframe as new columns. 看起来我需要使用groupby,但我无法弄清楚如何获取groupby值并将它们作为新列重新插入数据帧。

summary_stats = df.groupby(['Continent','Status'])['Units'].sum()
print summary_stats

Continent      Status    
Europe         Member        18
               Non-Member    23
North America  Member        27
               Non-Member     9
Name: Units, dtype: int64

I also tried not using groupby with these: 我也尝试过不使用groupby:

df['Member Units'] = df['Units'][df['Status'] == 'Member'].sum()
df['Non-Member Units'] = df['Units'][df['Status'] == 'Non-Member'].sum()

but that doesn't differentiate by Continent so it just adds up all the Members and Non-Members 但这并没有因为大陆而有所区别,所以它只是将所有会员和非会员加起来

Any help is greatly appreicated! 任何帮助都非常有用!

I think you need first groupby and transform sum for creating new Series all_sum . 我认为你需要第一个groupbytransform sum来创建新的all_sum Series Then I think is better use numpy.where and if is member, get value from Series , if not, get 0 . 然后我认为最好使用numpy.where并且如果是成员,从Series获取值,如果不是,则获得0 Similar with non-members: 与非成员类似:

all_sum = df.groupby(['Continent','Status'])['Units'].transform(sum)
print all_sum
0    27
1     9
2     9
3    18
4    23
5    18
6    23
dtype: int64

df['Member Units'] = np.where(df['Status'] == 'Member', all_sum, 0)
df['Non-Member Units'] = np.where(df['Status'] != 'Member', all_sum, 0)
print df
       Continent  Country      Status  Units  Member Units  Non-Member Units
0  North America       US      Member     27            27                 0
1  North America   Canada  Non-Member      5             0                 9
2  North America   Mexico  Non-Member      4             0                 9
3         Europe   France      Member     10            18                 0
4         Europe  Germany  Non-Member     15             0                23
5         Europe    Spain      Member      8            18                 0
6         Europe    Italy  Non-Member      8             0                23

Once you have summary_stats I think you can do something like: 一旦你有summary_stats我认为你可以这样做:

df['Member Units'] = summary_stats[zip(df['Continent'].values, df['Status'].values)]

The reason you need to zip the Series values is that df['Continent'] returns a series with indices, but you don't want that to happen. 您需要zip Series值的原因是df['Continent']返回带索引的系列,但您不希望这种情况发生。

Since you have summary_stats , you can use merge() after reshape it: 既然你有summary_stats ,你可以在重塑它之后使用merge()

summary = summary_stats.reset_index().pivot(index='Continent', columns='Status', values='Units')

summary['Continent'] = summary.index

df = df.merge(summary, on='Continent')

Then just rename columns as you want 然后只需根据需要重命名列

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM