简体   繁体   中英

pandas groupby sum() one column keep the rest

I have following dataframe

import pandas as pd
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
    'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
                })
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)

How can I groupby 'Courses' and sum only the Fee table but keep all the rest of the columns? When doing:

df2 = df.groupby('Courses').sum()

Both Fee and Discount are summed up + Duration is not in the resulting dataframe.

When doing:

df.groupby('Courses')['Fee'].sum()

I end up with only the Fee per Course.

Expected output:

          Fee Duration  Discount
Courses                          
Hadoop   48000  90days      2300
Pandas   26000  60days      2500
PySpark  25000  50days      2300
Python   46000  90days      2800
Spark    47000  85days      2400

Edit: My original DataFrame has more columns.

Any good approach for my question?

You can convert column Duration to timedeltas by to_timedelta and then aggregate sum with parameter numeric_only=False :

df['Duration'] = pd.to_timedelta(df['Duration'])
df2 = df.groupby('Courses').sum(numeric_only=False)
print (df2)
           Fee Duration  Discount
Courses                          
Hadoop   48000  90 days      2300
Pandas   26000  60 days      2500
PySpark  25000  50 days      2300
Python   46000  90 days      2800
Spark    47000  85 days      2400

EDIT: You can specify columns for aggregation sum in dictionary and for all another columns aggregate first :

technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
    'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600],
    'col': 'a' + pd.Series(np.arange(8).astype(str))
                })
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount','col'])
print(df)
   Courses    Fee Duration  Discount col
0    Spark  22000   30days      1000  a0
1  PySpark  25000   50days      2300  a1
2   Hadoop  23000   55days      1000  a2
3   Python  24000   40days      1200  a3
4   Pandas  26000   60days      2500  a4
5   Hadoop  25000   35days      1300  a5
6    Spark  25000   55days      1400  a6
7   Python  22000   50days      1600  a7

cols = ['Fee','Duration','Discount']
df['Duration'] = pd.to_timedelta(df['Duration'])
d = dict.fromkeys(cols, 'sum')
another = dict.fromkeys(df.columns.drop('Courses'), 'first')
df2 = df.groupby('Courses').agg({**d, **another})
print (df2)
           Fee Duration  Discount col
Courses                              
Hadoop   23000  55 days      1000  a2
Pandas   26000  60 days      2500  a4
PySpark  25000  50 days      2300  a1
Python   24000  40 days      1200  a3
Spark    22000  30 days      1000  a0

Another approach is use GroupBy.transform for columns from list:

cols = ['Fee','Duration','Discount']
df['Duration'] = pd.to_timedelta(df['Duration'])

df[cols] = df.groupby('Courses')[cols].transform('sum')
print (df)
   Courses    Fee Duration Discount col
0    Spark  47000  85 days     2400  a0
1  PySpark  25000  50 days     2300  a1
2   Hadoop  48000  90 days     2300  a2
3   Python  46000  90 days     2800  a3
4   Pandas  26000  60 days     2500  a4
5   Hadoop  48000  90 days     2300  a5
6    Spark  47000  85 days     2400  a6
7   Python  46000  90 days     2800  a7

And then for first rows per Courses use DataFrame.drop_duplicates :

df2 = df.drop_duplicates('Courses')
print (df2)
   Courses    Fee Duration Discount col
0    Spark  47000  85 days     2400  a0
1  PySpark  25000  50 days     2300  a1
2   Hadoop  48000  90 days     2300  a2
3   Python  46000  90 days     2800  a3
4   Pandas  26000  60 days     2500  a4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM