简体   繁体   中英

What is the most efficient way of counting occurrences of a bunch of specific keywords in pandas?

I know the way of getting the count of a specific keyword in pandas dataframe, but I am wondering if there is any efficient way of getting counts for each one of the set of specific keywords all together instead of doing one by one?

This is not a great question because there's so little detail, but I'll assume you have a series of strings, each of which contains some "words" separated by "delimiters", and you have a master list of keywords that you want the count of in each row? In that case,

>>> import pandas as pd, re
>>> s = pd.Series(['a,b', 'b,c', 'c'])   
>>> s
0    a,b
1    b,c
2      c
dtype: object
>>> keywords = ['a', 'b'] 
>>> pattern = re.compile('|'.join(map(re.escape, keywords)))  # Form regex matching any keyword
>>> s.str.count(pattern)
0    2
1    1
2    0
dtype: int64

If need count number of kewords from column not per each row like another anser, but total:

One possible solution is join values of column with space and split , for count use Counter and last filter in dict comprehension:

from collections import Counter

L = ['aaa','bbb','ccc']

c = Counter((' '.join(df['words'])).split())

out = {k: v for k, v in c.items() if k in L}

Modification - first split, then filter and last count - better if many unique words in real data:

out = Counter(x for x in (' '.join(df['words'])).split() if x in set(L))

Another pandas solution is first reshape, then filter and last count:

s = df['words'].str.split(expand=True).stack()
out = s[s.isin(L)].value_counts()

Timings :

Depends of number of words in list L , length of DataFrame and number of unique words, so in real data should be different:

df = pd.DataFrame({'words':['aaa vv bbb bbb ddd','bbb aaa','ccc ccc','bbb ccc']})
df = pd.concat([df] * 10000, ignore_index=True)

from collections import Counter

L = ['aaa','bbb','ccc']

c = Counter((' '.join(df['words'])).split())
out = {k: v for k, v in c.items() if k in L}
print (out)

s = df['words'].str.split(expand=True).stack()
out = s[s.isin(L)].value_counts()
print (out)

In [6]: %%timeit
   ...: c = Counter((' '.join(df['words'])).split())
   ...: out = {k: v for k, v in c.items() if k in L}
   ...: 
24.8 ms ± 276 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %%timeit 
   ...: s = df['words'].str.split(expand=True).stack()
   ...: out = s[s.isin(L)].value_counts()
   ...: 
145 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM