简体   繁体   中英

Create multiple columns with calculations using groupby, aggregate functions in Pandas

import pandas as pd
df = pd.DataFrame({'zip,company': ["46062|A","11236|B","11236|C","11236|C","11236|C","11236|A","11236|A","11236|A","11236|B","11236|B","11236|A","11236|A","11236|B","11236|A","11236|A","11236|B","11236|A","11236|A"], 
                   'goodbadscore': ["good","bad","bad","good","good","bad","bad","good","good","good","bad","good","good","good","good","bad","bad","good"],
                   'postlcode' : ["46062","11236","11236","11236","11236","46062","11236","46062","11236","11236","11236","11236","11236","11236","11236","11236","11236","11236"],
                   'companyname': ["A","B","C","C","C","A","A","A","B","B","A","A","B","A","A","B","A","A"]}
                   )
                   
print(df)

-----updated a sample data frame above as suggestion-----

I tried to produce the result in Excel, but using countif and countifs break my desktop and even it's fine, it takes several minutes to complete the task. hope can get some help and directions.

here is what i try to achieve:

I want to score company's' reputation in several zip codes based on the collected data. columns needed to produce:

  1. countinzipcode
  2. countgoodscoreinzip
  3. dividegoodscore%(2/1)
  4. ranking
  • I was able to produce 1 :

    op = df.groupby(['zip+company'])['zip+company'].count()

  • have difficulty on 2 : want to keep the output from 1, but it becomes 0 after apply. only want to show good for column 2

    op = op.groupby(['zip+company'])[['zip+company','countgoodscoreunderzip']].apply(lambda x: x[x=='good'].count())

  • then 3 , I guess it's a matter of selecting 2 and divided by 1

  • 4 no idea yet how to rank in pandas, which could be a simple ranking

The pic of excel is the ideal output( updated with a sample data frame ).

Thanks for the reading.

在此处输入图片说明

Named aggregation should help the first two columns:

op = df.groupby('zip,company', as_index=False).aggregate(
    countinzipcode=('zip,company', 'count'),
    goodscoreinzip=('goodbadscore', lambda s: s.eq('good').sum())
)

op :

  zip,company  countinzipcode  goodscoreinzip
0     11236|A               7               4
1     11236|B               5               3
2     11236|C               3               2
3     46062|A               3               2

Simple math operations can be used to get the percentage for 3:

op['goodscore%'] = op['goodscoreinzip'] / op['countinzipcode'] * 100
  zip,company  countinzipcode  goodscoreinzip  goodscore%
0     11236|A               7               4   57.142857
1     11236|B               5               3   60.000000
2     11236|C               3               2   66.666667
3     46062|A               3               2   66.666667

Then rank can be used to get the ranking for 4:

op['ranking'] = op['goodscore%'].rank(ascending=False, method='dense').astype(int)

op :

  zip,company  countinzipcode  goodscoreinzip  goodscore%  ranking
0     11236|A               7               4   57.142857        3
1     11236|B               5               3   60.000000        2
2     11236|C               3               2   66.666667        1
3     46062|A               3               2   66.666667        1

Sample Data Used (Based on the numbers in the image not the code constructor):

df = pd.DataFrame({
    'zip,company': ["46062|A", "11236|B", "11236|C", "11236|C",
                    "11236|C", "11236|A", "11236|A", "11236|A",
                    "11236|B", "11236|B", "11236|A", "11236|A",
                    "11236|B", "11236|A", "11236|A", "11236|B",
                    "46062|A", "46062|A"],
    'goodbadscore': ["good", "bad", "bad", "good", "good", "bad",
                     "bad", "good", "good", "good", "bad",
                     "good", "good", "good", "good", "bad",
                     "bad", "good"],
    'postlcode': ["46062", "11236", "11236", "11236", "11236",
                  "46062", "11236", "46062", "11236", "11236",
                  "11236", "11236", "11236", "11236", "11236",
                  "11236", "11236", "11236"],
    'companyname': ["A", "B", "C", "C", "C", "A", "A", "A", "B",
                    "B", "A", "A", "B", "A", "A", "B", "A", "A"]
})
   zip,company goodbadscore postlcode companyname
0      46062|A         good     46062           A
1      11236|B          bad     11236           B
2      11236|C          bad     11236           C
3      11236|C         good     11236           C
4      11236|C         good     11236           C
5      11236|A          bad     46062           A
6      11236|A          bad     11236           A
7      11236|A         good     46062           A
8      11236|B         good     11236           B
9      11236|B         good     11236           B
10     11236|A          bad     11236           A
11     11236|A         good     11236           A
12     11236|B         good     11236           B
13     11236|A         good     11236           A
14     11236|A         good     11236           A
15     11236|B          bad     11236           B
16     46062|A          bad     11236           A
17     46062|A         good     11236           A

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM