简体   繁体   中英

Python pandas - Aggregate values in separate columns

I am new in Python and I was wondering if somebody can help me with the below task.

I am having the below dataframe df with the following columns:

  • Primary ID
  • Secondary ID
  • Entity type
  • Value

Each primary entity (Entity type: A) might consist of some secondary entities (Entity types; X or Y). For entity types A, the primary id is the same with the secondary id. Also, each primary entity and each secondary entity have a value.

In columns 'Sum of values Secondary id X' and 'Sum of values Secondary id Y', I want to have the aggregate value of the secondary entities (X and Y) which correspond to each primary entity. The aggregate values should be in the row of the primary entity.

So, my initial df is this:

Primary ID Secondary ID Entity type Value
0109 0109 A 200
0109 A234 X 100
0109 A234 X 50
9996 9996 A 400
9996 AAGT X 120
9996 AABG X 30
9996 0082 Y 50
A765 A765 A 50

And I just want to add the 2 columns, without changing the format of the initial dataframe:

Primary ID Secondary ID Entity type Value Sum of values Secondary id X Sum of values Secondary id Y
0109 0109 A 200 150 0
0109 A234 X 100 0 0
0109 A234 X 50 0 0
9996 9996 A 400 150 50
9996 AAGT X 120 0 0
9996 AABG X 30 0 0
9996 0082 Y 50 0 0
A765 A765 A 50 0 0

Thank you!

try creating dict using:

(df.groupby(['Primary ID', 'Entity type'])['Value']      
        .sum()    
        .unstack(-1)                    
        .fillna(0).reset_index()                      
        ) 

Entity type Primary ID  A   X       Y
0           0109    200.0   150.0   0.0
1           9996    400.0   150.0   50.0
2           A765    50.0    0.0     0.0

The above data can be converted into a dict and then you can map.

Complete solution:

map_df = (df.groupby(['Primary ID', 'Entity type'])['Value']      
        .sum()    
        .unstack(-1)                    
        .fillna(0).reset_index()                      
        ).drop(['A'], axis=1).set_index('Primary ID')

df.set_index('Primary ID', inplace=True)
df.loc[df['Entity type'].eq('A'),['new_x', 'new_y']] = map_df.values

df:

Secondary ID Entity type Value new_x new_y
Primary ID
0109 0109 A 200 150.0 0.0
0109 A234 X 100 0.0 0.0
0109 A234 X 50 0.0 0.0
9996 9996 A 400 150.0 50.0
9996 AAGT X 120 0.0 0.0
9996 AABG X 30 0.0 0.0
9996 0082 Y 50 0.0 0.0
A765 A765 A 50 0.0 0.0

First solution is used first by replace Value s by missing values by Series.where and then for new columns is used GroupBy.transform , last replace duplicates by 0 in DataFrame.mask :

cols = ['Sum of values Secondary id X','Sum of values Secondary id Y']
df[cols] = (df.assign(x = df['Value'].where(df['Entity type'].eq('X')),
                      y = df['Value'].where(df['Entity type'].eq('Y')))
              .groupby('Primary ID')[['x','y']]
              .transform('sum')
              .mask(df['Primary ID'].duplicated(), 0)
              .astype(int))
print (df)
  Primary ID Secondary ID Entity type  Value  Sum of values Secondary id X  \
0       0109         0109           A    200                           150   
1       0109         A234           X    100                             0   
2       0109         A234           X     50                             0   
3       9996         9996           A    400                           150   
4       9996         AAGT           X    120                             0   
5       9996         AABG           X     30                             0   
6       9996         0082           Y     50                             0   
7       A765         A765           A     50                             0   

   Sum of values Secondary id Y  
0                             0  
1                             0  
2                             0  
3                            50  
4                             0  
5                             0  
6                             0  
7                             0 

If need all values to new columns use DataFrame.pivot_table with DataFrame.join :

df1 = (df.pivot_table(index='Primary ID', 
                    columns='Entity type', 
                    values='Value', 
                    aggfunc='sum', fill_value=0)
         .add_prefix('Sum of values Secondary id '))


df = df.join(df1, on='Primary ID')
df.loc[df['Primary ID'].duplicated(), df1.columns] = 0
print (df)
  Primary ID Secondary ID Entity type  Value  Sum of values Secondary id A  \
0       0109         0109           A    200                           200   
1       0109         A234           X    100                             0   
2       0109         A234           X     50                             0   
3       9996         9996           A    400                           400   
4       9996         AAGT           X    120                             0   
5       9996         AABG           X     30                             0   
6       9996         0082           Y     50                             0   
7       A765         A765           A     50                            50   

   Sum of values Secondary id X  Sum of values Secondary id Y  
0                           150                             0  
1                             0                             0  
2                             0                             0  
3                           150                            50  
4                             0                             0  
5                             0                             0  
6                             0                             0  
7                             0                             0  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM