I know the way of getting the count of a specific keyword in pandas dataframe, but I am wondering if there is any efficient way of getting counts for each one of the set of specific keywords all together instead of doing one by one?
This is not a great question because there's so little detail, but I'll assume you have a series of strings, each of which contains some "words" separated by "delimiters", and you have a master list of keywords that you want the count of in each row? In that case,
>>> import pandas as pd, re
>>> s = pd.Series(['a,b', 'b,c', 'c'])
>>> s
0 a,b
1 b,c
2 c
dtype: object
>>> keywords = ['a', 'b']
>>> pattern = re.compile('|'.join(map(re.escape, keywords))) # Form regex matching any keyword
>>> s.str.count(pattern)
0 2
1 1
2 0
dtype: int64
If need count number of kewords from column not per each row like another anser, but total:
One possible solution is join
values of column with space and split
, for count use Counter
and last filter in dict comprehension:
from collections import Counter
L = ['aaa','bbb','ccc']
c = Counter((' '.join(df['words'])).split())
out = {k: v for k, v in c.items() if k in L}
Modification - first split, then filter and last count - better if many unique words in real data:
out = Counter(x for x in (' '.join(df['words'])).split() if x in set(L))
Another pandas solution is first reshape, then filter and last count:
s = df['words'].str.split(expand=True).stack()
out = s[s.isin(L)].value_counts()
Timings :
Depends of number of words in list L
, length of DataFrame and number of unique words, so in real data should be different:
df = pd.DataFrame({'words':['aaa vv bbb bbb ddd','bbb aaa','ccc ccc','bbb ccc']})
df = pd.concat([df] * 10000, ignore_index=True)
from collections import Counter
L = ['aaa','bbb','ccc']
c = Counter((' '.join(df['words'])).split())
out = {k: v for k, v in c.items() if k in L}
print (out)
s = df['words'].str.split(expand=True).stack()
out = s[s.isin(L)].value_counts()
print (out)
In [6]: %%timeit
...: c = Counter((' '.join(df['words'])).split())
...: out = {k: v for k, v in c.items() if k in L}
...:
24.8 ms ± 276 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %%timeit
...: s = df['words'].str.split(expand=True).stack()
...: out = s[s.isin(L)].value_counts()
...:
145 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.