I was trying to find the best variable to split a decision tree on and it required grouping and counting the occurrence of some values. A dummy data set is
zipped=[(‘a’, ‘None’), (‘b’, ‘Premium’), (‘c’, ‘Basic’), (‘d’, ‘Basic’), (‘b’, ‘Premium’), (‘e’, ‘None’), (‘e’, ‘Basic’), (‘b’, ‘Premium’), (‘a’, ‘None’), (‘c’, ‘None’), (‘b’, ‘None’), (‘d’, ‘None’), (‘c’, ‘Basic’), (‘a’, ‘None’), (‘b’, ‘Basic’), (‘e’, ‘Basic’)]
So, I would like to find how many None, Basic and Premium are there in each of the a,b,c,d,e I need it to look like
{‘a’:[‘None’:3,‘Basic’:0,‘Premium’:0], ‘b’:[‘None’:1,‘Basic’:1,‘Premium’:3],…} .
I am also open to a better way of aggregation or data structure. Here is what I tried to do
temp=Counter( x[1] for x in zipped if x[0]=='b')
print(temp)
and I got
Counter({'Premium': 3, 'None': 1, 'Basic': 1})
Assuming your a
, b
etc are your slashdot
, google
:
zipped=[('a', 'None'), ('b', 'Premium'), ('c', 'Basic'), ('d', 'Basic'), ('b', 'Premium'),
('e', 'None'), ('e', 'Basic'), ('b', 'Premium'), ('a', 'None'), ('c', 'None'),
('b', 'None'), ('d', 'None'), ('c', 'Basic'), ('a', 'None'), ('b', 'Basic'),
('e', 'Basic')]
from collections import Counter
d = {}
for key,val in zipped:
d.setdefault(key,[]).append(val) # create key with empty list (if needed) + append val.
# now they are ordered lists, overwrite with Counter of it:
for key in d:
d[key] = Counter(d[key])
print(d)
Output:
{'a': Counter({'None': 3}),
'b': Counter({'Premium': 3, 'None': 1, 'Basic': 1}),
'c': Counter({'Basic': 2, 'None': 1}),
'd': Counter({'Basic': 1, 'None': 1}),
'e': Counter({'Basic': 2, 'None': 1})}
Counter gives you .most_common()
to get the lists you want:
for k in d:
print(k,d[k].most_common())
Output:
a [('None', 3)]
b [('Premium', 3), ('None', 1), ('Basic', 1)]
c [('Basic', 2), ('None', 1)]
d [('Basic', 1), ('None', 1)]
e [('Basic', 2), ('None', 1)]
If you really need 0-counts, you can add them after the fact:
allVals = {v for _,v in zipped} # get distinct values of zipped
for key in d:
for v in allVals:
d[key].update([v]) # add value once
d[key].subtract([v]) # subtract value once
Bit cumbersome, but that way anyting will be present for all of them, with a 0 value if not present in zipped
for k in d:
print(k,d[k].most_common())
Output:
a [('None', 3), ('Premium', 0), ('Basic', 0)]
b [('Premium', 3), ('None', 1), ('Basic', 1)]
c [('Basic', 2), ('None', 1), ('Premium', 0)]
d [('Basic', 1), ('None', 1), ('Premium', 0)]
e [('Basic', 2), ('None', 1), ('Premium', 0)]
You can try something like this :
data=[('a', 'None'), ('b', 'Premium'), ('c', 'Basic'), ('d', 'Basic'), ('b', 'Premium'),
('e', 'None'), ('e', 'Basic'), ('b', 'Premium'), ('a', 'None'), ('c', 'None'),
('b', 'None'), ('d', 'None'), ('c', 'Basic'), ('a', 'None'), ('b', 'Basic'),
('e', 'Basic')]
manual_dict={}
for i,j in enumerate(data):
if j[0] not in manual_dict:
manual_dict[j[0]]=[j[1]]
else:
manual_dict[j[0]].append(j[1])
final_dict={}
for ia,aj in manual_dict.items():
final_dict[ia]={'None':aj.count('None'),'Basic':aj.count('Basic'),'Premium':aj.count('Premium')}
print(final_dict)
output:
{'c': {'Premium': 0, 'None': 1, 'Basic': 2}, 'a': {'Premium': 0, 'None': 3, 'Basic': 0}, 'd': {'Premium': 0, 'None': 1, 'Basic': 1}, 'b': {'Premium': 3, 'None': 1, 'Basic': 1}, 'e': {'Premium': 0, 'None': 1, 'Basic': 2}}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.