I have a dataframe in which each row shows one transaction and items within that transactions. Here is how my dataframe looks like
itemList
A,B,C
B,F
G,A
...
I want to find the frequency of each item (how many times it appeared in the transactions. I have defined a dictionary and try to update its value as shown below
dict ={}
def update(itemList):
#Update the value of each item in the dict
df.itemList.apply(lambda x: update(x))
As apply
function gets executed for multiple row at the same time, multiple rows try to update the values in dict
at the same time and it's causing an issue. How can I make sure multiple updated to dict
does not cause any issue?
I think you only need Series.str.get_dummies
:
df['itemList'].str.get_dummies(',').sum().to_dict()
#{'A': 2, 'B': 2, 'C': 1, 'F': 1, 'G': 1}
If there are more columns use:
df.stack().str.get_dummies(',').sum().to_dict()
if you want to count for each row:
df['itemList'].str.get_dummies(',').to_dict('index')
#{0: {'A': 1, 'B': 1, 'C': 1, 'F': 0, 'G': 0},
# 1: {'A': 0, 'B': 1, 'C': 0, 'F': 1, 'G': 0},
# 2: {'A': 1, 'B': 0, 'C': 0, 'F': 0, 'G': 1}}
As @Quang Hoang said in the comments apply
simply apply the function to each row / column using a loop
You might be better off relying on native python here,
df = pd.DataFrame({'itemlist':['a,b,c', 'b,f', 'g,a', 'd,g,f,d,s,a,v', 'e,w,d,f,g,h', 's,d,f,e,r,t', 'e,d,f,g,r,r','s,d,f']})
Here is a solution using Counter,
df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
Some comparisons,
%timeit df['itemlist'].str.split(',', expand = True).stack().value_counts().to_dict()
2.64 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['itemlist'].str.get_dummies(',').sum().to_dict()
3.22 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
from collections import Counter
%timeit df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
778 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.