简体   繁体   English

Python Pandas:替换groupby操作

[英]Python Pandas: replace groupby operation

I have the following table as a pandas dataframe : 我有下表作为pandas dataframe

| ID | Name | Sales | Source   |
|----|------|-------|----------|
| 1  | a    | 34    | Source A |
| 2  | b    | 3423  | Source A |
| 3  | c    | 2     | Source A |
| 4  | d    | 342   | Source A |
| 3  | c    | 34    | Source A |
| 5  | e    | 234   | Source A |
| 6  | f    | 234   | Source A |
| 7  | g    | 23    | Source A |
| 1  | a    | 12    | Source B |
| 2  | b    | 42    | Source B |
| 3  | c    | 9     | Source B |
| 2  | b    | 22    | Source B |
| 1  | a    | 1     | Source B |
| 8  | h    | 56    | Source B |

What is the best way to (i) aggregate sales for each ID for each soure and (ii) put the result in two new columns "Source A" and "Source B" such that the resulting dataframe looks as follows: 最佳方法是(i)汇总每个标识的每个ID的销售额,以及(ii)将结果放入两个新列“源A”和“源B”,以使结果dataframe如下所示:

| ID | Name | Source A | Source B |
|----|------|----------|----------|
| 1  | a    | 34       | 13       |
| 2  | b    | 3423     | 64       |
| 3  | c    | 36       | 9        |
| 4  | d    | 342      | 0        |
| 5  | e    | 234      | 0        |
| 6  | f    | 234      | 0        |
| 7  | g    | 23       | 0        |
| 8  | h    | 0        | 56       |

My initial approach was as follows: 我最初的方法如下:

data = {"ID":[1,2,3,4,3,5,6,7,1,2,3,2,1,8], 
      "Name":list("abcdcefgabcbah"), 
      "Sales":[34,3423,2,342,34,234,234,23,12,42,9,22,1,56],
      "Source":["Source A"]*8 + ["Source B"]*6
     }
df = pd.DataFrame(data)

df.groupby(["ID","Name","Source"])["Sales"].sum().unstack()

Question : my initial table is build using different files and than applying pd.concat . 问题 :我的初始表是使用不同的文件构建的,而不是应用pd.concat So it feels like I could achieve the final table by concatenating (or merging) differently in the first place. 因此,感觉我可以通过首先以不同的方式串联(或合并)来获得最终表。 Is there a better approach to achieve this? 是否有更好的方法来实现这一目标? As a side node: the actual data table consists out of 6 different sources. 作为副节点:实际数据表由6个不同的来源组成。

Thanks for your help! 谢谢你的帮助!

You can use pd.crosstab : 您可以使用pd.crosstab

pd.crosstab(df.Name, df.Source, df.Sales, aggfunc='sum').fillna(0)

Output: 输出:

Source  Source A  Source B
Name                      
a           34.0      13.0
b         3423.0      64.0
c           36.0       9.0
d          342.0       0.0
e          234.0       0.0
f          234.0       0.0
g           23.0       0.0
h            0.0      56.0

Or, pivot_table 或者,pivot_table

df.pivot_table('Sales','Name','Source', aggfunc='sum').fillna(0)

Output: 输出:

Source  Source A  Source B
Name                      
a           34.0      13.0
b         3423.0      64.0
c           36.0       9.0
d          342.0       0.0
e          234.0       0.0
f          234.0       0.0
g           23.0       0.0
h            0.0      56.0

Or using set_index and sum with level parameter then unstack : 或者使用set_index并使用level参数sum ,然后unstack

df.set_index(['Name','Source'])['Sales'].sum(level=[0,1]).unstack(fill_value=0) 

Output: 输出:

Source  Source A  Source B
Name                      
a             34        13
b           3423        64
c             36         9
d            342         0
e            234         0
f            234         0
g             23         0
h              0        56

Try the following code: 尝试以下代码:

df.groupby(['Name', 'Source'])['Sales'].sum()\
    .unstack(1).fillna(0).reset_index()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM