import pandas as pd
df = pd.DataFrame({'zip,company': ["46062|A","11236|B","11236|C","11236|C","11236|C","11236|A","11236|A","11236|A","11236|B","11236|B","11236|A","11236|A","11236|B","11236|A","11236|A","11236|B","11236|A","11236|A"],
'goodbadscore': ["good","bad","bad","good","good","bad","bad","good","good","good","bad","good","good","good","good","bad","bad","good"],
'postlcode' : ["46062","11236","11236","11236","11236","46062","11236","46062","11236","11236","11236","11236","11236","11236","11236","11236","11236","11236"],
'companyname': ["A","B","C","C","C","A","A","A","B","B","A","A","B","A","A","B","A","A"]}
)
print(df)
-----updated a sample data frame above as suggestion-----
I tried to produce the result in Excel, but using countif and countifs break my desktop and even it's fine, it takes several minutes to complete the task. hope can get some help and directions.
here is what i try to achieve:
I want to score company's' reputation in several zip codes based on the collected data. columns needed to produce:
I was able to produce 1 :
op = df.groupby(['zip+company'])['zip+company'].count()
have difficulty on 2 : want to keep the output from 1, but it becomes 0 after apply. only want to show good for column 2
op = op.groupby(['zip+company'])[['zip+company','countgoodscoreunderzip']].apply(lambda x: x[x=='good'].count())
then 3 , I guess it's a matter of selecting 2 and divided by 1
4 no idea yet how to rank in pandas, which could be a simple ranking
The pic of excel is the ideal output( updated with a sample data frame ).
Thanks for the reading.
Named aggregation should help the first two columns:
op = df.groupby('zip,company', as_index=False).aggregate(
countinzipcode=('zip,company', 'count'),
goodscoreinzip=('goodbadscore', lambda s: s.eq('good').sum())
)
op
:
zip,company countinzipcode goodscoreinzip
0 11236|A 7 4
1 11236|B 5 3
2 11236|C 3 2
3 46062|A 3 2
Simple math operations can be used to get the percentage for 3:
op['goodscore%'] = op['goodscoreinzip'] / op['countinzipcode'] * 100
zip,company countinzipcode goodscoreinzip goodscore%
0 11236|A 7 4 57.142857
1 11236|B 5 3 60.000000
2 11236|C 3 2 66.666667
3 46062|A 3 2 66.666667
Then rank
can be used to get the ranking for 4:
op['ranking'] = op['goodscore%'].rank(ascending=False, method='dense').astype(int)
op
:
zip,company countinzipcode goodscoreinzip goodscore% ranking
0 11236|A 7 4 57.142857 3
1 11236|B 5 3 60.000000 2
2 11236|C 3 2 66.666667 1
3 46062|A 3 2 66.666667 1
Sample Data Used (Based on the numbers in the image not the code constructor):
df = pd.DataFrame({
'zip,company': ["46062|A", "11236|B", "11236|C", "11236|C",
"11236|C", "11236|A", "11236|A", "11236|A",
"11236|B", "11236|B", "11236|A", "11236|A",
"11236|B", "11236|A", "11236|A", "11236|B",
"46062|A", "46062|A"],
'goodbadscore': ["good", "bad", "bad", "good", "good", "bad",
"bad", "good", "good", "good", "bad",
"good", "good", "good", "good", "bad",
"bad", "good"],
'postlcode': ["46062", "11236", "11236", "11236", "11236",
"46062", "11236", "46062", "11236", "11236",
"11236", "11236", "11236", "11236", "11236",
"11236", "11236", "11236"],
'companyname': ["A", "B", "C", "C", "C", "A", "A", "A", "B",
"B", "A", "A", "B", "A", "A", "B", "A", "A"]
})
zip,company goodbadscore postlcode companyname
0 46062|A good 46062 A
1 11236|B bad 11236 B
2 11236|C bad 11236 C
3 11236|C good 11236 C
4 11236|C good 11236 C
5 11236|A bad 46062 A
6 11236|A bad 11236 A
7 11236|A good 46062 A
8 11236|B good 11236 B
9 11236|B good 11236 B
10 11236|A bad 11236 A
11 11236|A good 11236 A
12 11236|B good 11236 B
13 11236|A good 11236 A
14 11236|A good 11236 A
15 11236|B bad 11236 B
16 46062|A bad 11236 A
17 46062|A good 11236 A
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.