Create multiple columns with calculations using groupby, aggregate functions in Pandas

Question

import pandas as pd
df = pd.DataFrame({'zip,company': ["46062|A","11236|B","11236|C","11236|C","11236|C","11236|A","11236|A","11236|A","11236|B","11236|B","11236|A","11236|A","11236|B","11236|A","11236|A","11236|B","11236|A","11236|A"], 
                   'goodbadscore': ["good","bad","bad","good","good","bad","bad","good","good","good","bad","good","good","good","good","bad","bad","good"],
                   'postlcode' : ["46062","11236","11236","11236","11236","46062","11236","46062","11236","11236","11236","11236","11236","11236","11236","11236","11236","11236"],
                   'companyname': ["A","B","C","C","C","A","A","A","B","B","A","A","B","A","A","B","A","A"]}
                   )
                   
print(df)

-----updated a sample data frame above as suggestion-----

I tried to produce the result in Excel, but using countif and countifs break my desktop and even it's fine, it takes several minutes to complete the task. hope can get some help and directions.

here is what i try to achieve:

I want to score company's' reputation in several zip codes based on the collected data. columns needed to produce:

countinzipcode
countgoodscoreinzip
dividegoodscore%(2/1)
ranking

I was able to produce 1 :
op = df.groupby(['zip+company'])['zip+company'].count()
have difficulty on 2 : want to keep the output from 1, but it becomes 0 after apply. only want to show good for column 2
op = op.groupby(['zip+company'])[['zip+company','countgoodscoreunderzip']].apply(lambda x: x[x=='good'].count())
then 3 , I guess it's a matter of selecting 2 and divided by 1
4 no idea yet how to rank in pandas, which could be a simple ranking

The pic of excel is the ideal output( updated with a sample data frame ).

Thanks for the reading.

Answer 1

Named aggregation should help the first two columns:

op = df.groupby('zip,company', as_index=False).aggregate(
    countinzipcode=('zip,company', 'count'),
    goodscoreinzip=('goodbadscore', lambda s: s.eq('good').sum())
)

op :

  zip,company  countinzipcode  goodscoreinzip
0     11236|A               7               4
1     11236|B               5               3
2     11236|C               3               2
3     46062|A               3               2

Simple math operations can be used to get the percentage for 3:

op['goodscore%'] = op['goodscoreinzip'] / op['countinzipcode'] * 100

  zip,company  countinzipcode  goodscoreinzip  goodscore%
0     11236|A               7               4   57.142857
1     11236|B               5               3   60.000000
2     11236|C               3               2   66.666667
3     46062|A               3               2   66.666667

Then rank can be used to get the ranking for 4:

op['ranking'] = op['goodscore%'].rank(ascending=False, method='dense').astype(int)

op :

  zip,company  countinzipcode  goodscoreinzip  goodscore%  ranking
0     11236|A               7               4   57.142857        3
1     11236|B               5               3   60.000000        2
2     11236|C               3               2   66.666667        1
3     46062|A               3               2   66.666667        1

Sample Data Used (Based on the numbers in the image not the code constructor):

df = pd.DataFrame({
    'zip,company': ["46062|A", "11236|B", "11236|C", "11236|C",
                    "11236|C", "11236|A", "11236|A", "11236|A",
                    "11236|B", "11236|B", "11236|A", "11236|A",
                    "11236|B", "11236|A", "11236|A", "11236|B",
                    "46062|A", "46062|A"],
    'goodbadscore': ["good", "bad", "bad", "good", "good", "bad",
                     "bad", "good", "good", "good", "bad",
                     "good", "good", "good", "good", "bad",
                     "bad", "good"],
    'postlcode': ["46062", "11236", "11236", "11236", "11236",
                  "46062", "11236", "46062", "11236", "11236",
                  "11236", "11236", "11236", "11236", "11236",
                  "11236", "11236", "11236"],
    'companyname': ["A", "B", "C", "C", "C", "A", "A", "A", "B",
                    "B", "A", "A", "B", "A", "A", "B", "A", "A"]
})

   zip,company goodbadscore postlcode companyname
0      46062|A         good     46062           A
1      11236|B          bad     11236           B
2      11236|C          bad     11236           C
3      11236|C         good     11236           C
4      11236|C         good     11236           C
5      11236|A          bad     46062           A
6      11236|A          bad     11236           A
7      11236|A         good     46062           A
8      11236|B         good     11236           B
9      11236|B         good     11236           B
10     11236|A          bad     11236           A
11     11236|A         good     11236           A
12     11236|B         good     11236           B
13     11236|A         good     11236           A
14     11236|A         good     11236           A
15     11236|B          bad     11236           B
16     46062|A          bad     11236           A
17     46062|A         good     11236           A

Create multiple columns with calculations using groupby, aggregate functions in Pandas

Question

1 answers

solution1
1 ACCPTED 2021-07-09 19:18:34

Create multiple columns with calculations using groupby, aggregate functions in Pandas

Question

1 answers

solution1 1 ACCPTED 2021-07-09 19:18:34

solution1
1 ACCPTED 2021-07-09 19:18:34