Count regex matches in one column by values in another column with pandas

Question

I am working with pandas and have a dataframe that contains a list of sentences and people who said them, like this:

 sentence                 person
 'hello world'              Matt
 'cake, delicious cake!'    Matt
 'lovely day'               Maria
 'i like cake'             Matt
 'a new day'                Maria
 'a new world'              Maria

I want to count non-overlapping matches of regex strings in sentence (eg cake , world , day ) by the person . Note each row of sentence may contain more than one match (eg cake ):

person        'day'        'cake'       'world'
Matt            0            3             1
Maria           2            0             1

So far I am doing this:

rows_cake = df[df['sentences'].str.contains(r"cake")
counts_cake = rows_cake.value_counts()

However this str.contains gives me rows containing cake , but not individual instances of cake .

I know I can use str.counts(r"cake") on rows_cake . However, in practise my dataframe is extremely large (> 10 million rows) and the regexes I am using are quite complex so I am looking for a more efficient solution if possible.

Answer 1

Maybe you should first try to get the sentence itself and then use re to do your optimized regex stuff like that:

for row in df.itertuples(index=False):
   do_some_regex_stuff(row[0], row[1])#in this case row[0] is a sentence. row[1] is person

As far as I know itertuples is quiet fast (Notes no.1 here ). So the only optimization problem you have is with regex itself.

Answer 2

I came up with rather simple solution. But cant claim it to be the fastest or efficient.

import pandas as pd
import numpy as np

# to be used with read_clipboard()
'''
sentence    person
'hello world'   Matt
'cake, delicious cake!' Matt
'lovely day'    Maria
'i like cake'   Matt
'a new day' Maria
'a new world'   Maria
'''

df = pd.read_clipboard()
# print(df)

Output:

                  sentence person
0            'hello world'   Matt
1  'cake, delicious cake!'   Matt
2             'lovely day'  Maria
3            'i like cake'   Matt
4              'a new day'  Maria
5            'a new world'  Maria

.

# if the list of keywords is fix and relatively small
keywords = ['day', 'cake', 'world']

# for each keyword and each string, counting the occourance
for key in keywords:
    df[key] = [(len(val.split(key)) - 1) for val in df['sentence']]

# print(df)

Output:

                 sentence person  day  cake  world
0            'hello world'   Matt    0     0      1
1  'cake, delicious cake!'   Matt    0     2      0
2             'lovely day'  Maria    1     0      0
3            'i like cake'   Matt    0     1      0
4              'a new day'  Maria    1     0      0
5            'a new world'  Maria    0     0      1

.

# create a simple pivot with what data you needed
df_pivot = pd.pivot_table(df, 
values=['day', 'cake', 'world'], 
columns=['person'], 
aggfunc=np.sum).T

# print(df_pivot)

Final Output:

        cake  day  world
person
Maria      0    2      1
Matt       3    0      1

Open to suggestions if this seems to be a good approach especially given the volume of data. Eager to learn.

Answer 3

since this primarily involves strings, I would suggest taking the computation out of Pandas - Python is faster than Pandas in most cases when it comes to string manipulation:

#read in data
df = pd.read_clipboard(sep='\s{2,}', engine='python')

#create a dictionary of persons and sentences : 
from collections import defaultdict, ChainMap
d = defaultdict(list)
for k,v in zip(df.person, df.sentence):
    d[k].append(v)


d = {k:",".join(v) for k,v in d.items()}

#search words
strings = ("cake", "world", "day")

#get count of words and create a dict
m = defaultdict(list)
for k,v in d.items():
    for st in strings:
        m[k].append({st:v.count(st)})

res = {k:dict(ChainMap(*v)) for k,v in m.items()}


print(res)
{'Matt': {'day': 0, 'world': 1, 'cake': 3},
 'Maria': {'day': 2, 'world': 1, 'cake': 0}}

output = pd.DataFrame(res).T

       day  world   cake
Matt    0     1     3
Maria   2     1     0

test the speeds and see which one is better. it would be useful for me and others as well.

Count regex matches in one column by values in another column with pandas

Question

3 answers

solution1
0 2020-05-15 11:06:56

solution2
0 2020-05-15 11:51:34

solution3
0 2020-05-15 11:59:32

Count regex matches in one column by values in another column with pandas

Question

3 answers

solution1 0 2020-05-15 11:06:56

solution2 0 2020-05-15 11:51:34

solution3 0 2020-05-15 11:59:32

solution1
0 2020-05-15 11:06:56

solution2
0 2020-05-15 11:51:34

solution3
0 2020-05-15 11:59:32