[英]Pandas add a new column based on conditional logic of many other columns
[英]Add new column based on sum of a column and grouped by 2 other columns in Pandas
我有數據幀:
df = pd.DataFrame({'Continent':['North America','North America','North America','Europe','Europe','Europe','Europe'],
'Country': ['US','Canada','Mexico','France','Germany','Spain','Italy'],
'Status': ['Member','Non-Member','Non-Member','Member','Non-Member','Member','Non-Member'],
'Units': [27,5,4,10,15,8,8]})
print df
Continent Country Status Units
0 North America US Member 27
1 North America Canada Non-Member 5
2 North America Mexico Non-Member 4
3 Europe France Member 10
4 Europe Germany Non-Member 15
5 Europe Spain Member 8
6 Europe Italy Non-Member 8
我需要添加2列,這些列是關於大陸的摘要統計信息。 我需要一個列有成員國和非成員國單位總和的專欄。
這樣最終輸出看起來像:
Continent Member Units Non-Member Units Country Status Units
0 North America 27 9 US Member 27
1 North America 27 9 Canada Non-Member 5
2 North America 27 9 Mexico Non-Member 4
3 Europe 18 23 France Member 10
4 Europe 18 23 Germany Non-Member 15
5 Europe 18 23 Spain Member 8
6 Europe 18 23 Italy Non-Member 8
看起來我需要使用groupby,但我無法弄清楚如何獲取groupby值並將它們作為新列重新插入數據幀。
summary_stats = df.groupby(['Continent','Status'])['Units'].sum()
print summary_stats
Continent Status
Europe Member 18
Non-Member 23
North America Member 27
Non-Member 9
Name: Units, dtype: int64
我也嘗試過不使用groupby:
df['Member Units'] = df['Units'][df['Status'] == 'Member'].sum()
df['Non-Member Units'] = df['Units'][df['Status'] == 'Non-Member'].sum()
但這並沒有因為大陸而有所區別,所以它只是將所有會員和非會員加起來
任何幫助都非常有用!
我認為你需要第一個groupby
並transform
sum
來創建新的all_sum
Series
。 然后我認為最好使用numpy.where
並且如果是成員,從Series
獲取值,如果不是,則獲得0
。 與非成員類似:
all_sum = df.groupby(['Continent','Status'])['Units'].transform(sum)
print all_sum
0 27
1 9
2 9
3 18
4 23
5 18
6 23
dtype: int64
df['Member Units'] = np.where(df['Status'] == 'Member', all_sum, 0)
df['Non-Member Units'] = np.where(df['Status'] != 'Member', all_sum, 0)
print df
Continent Country Status Units Member Units Non-Member Units
0 North America US Member 27 27 0
1 North America Canada Non-Member 5 0 9
2 North America Mexico Non-Member 4 0 9
3 Europe France Member 10 18 0
4 Europe Germany Non-Member 15 0 23
5 Europe Spain Member 8 18 0
6 Europe Italy Non-Member 8 0 23
一旦你有summary_stats
我認為你可以這樣做:
df['Member Units'] = summary_stats[zip(df['Continent'].values, df['Status'].values)]
您需要zip
Series值的原因是df['Continent']
返回帶索引的系列,但您不希望這種情況發生。
既然你有summary_stats
,你可以在重塑它之后使用merge()
:
summary = summary_stats.reset_index().pivot(index='Continent', columns='Status', values='Units')
summary['Continent'] = summary.index
df = df.merge(summary, on='Continent')
然后只需根據需要重命名列
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.