I am working with pandas and have a dataframe that contains a list of sentences and people who said them, like this:
sentence person
'hello world' Matt
'cake, delicious cake!' Matt
'lovely day' Maria
'i like cake' Matt
'a new day' Maria
'a new world' Maria
I want to count non-overlapping matches of regex strings in sentence
(eg cake
, world
, day
) by the person
. Note each row of sentence
may contain more than one match (eg cake
):
person 'day' 'cake' 'world'
Matt 0 3 1
Maria 2 0 1
So far I am doing this:
rows_cake = df[df['sentences'].str.contains(r"cake")
counts_cake = rows_cake.value_counts()
However this str.contains
gives me rows containing cake
, but not individual instances of cake
.
I know I can use str.counts(r"cake")
on rows_cake
. However, in practise my dataframe is extremely large (> 10 million rows) and the regexes I am using are quite complex so I am looking for a more efficient solution if possible.
Maybe you should first try to get the sentence itself and then use re
to do your optimized regex stuff like that:
for row in df.itertuples(index=False):
do_some_regex_stuff(row[0], row[1])#in this case row[0] is a sentence. row[1] is person
As far as I know itertuples is quiet fast (Notes no.1 here ). So the only optimization problem you have is with regex itself.
I came up with rather simple solution. But cant claim it to be the fastest or efficient.
import pandas as pd
import numpy as np
# to be used with read_clipboard()
'''
sentence person
'hello world' Matt
'cake, delicious cake!' Matt
'lovely day' Maria
'i like cake' Matt
'a new day' Maria
'a new world' Maria
'''
df = pd.read_clipboard()
# print(df)
Output:
sentence person
0 'hello world' Matt
1 'cake, delicious cake!' Matt
2 'lovely day' Maria
3 'i like cake' Matt
4 'a new day' Maria
5 'a new world' Maria
.
# if the list of keywords is fix and relatively small
keywords = ['day', 'cake', 'world']
# for each keyword and each string, counting the occourance
for key in keywords:
df[key] = [(len(val.split(key)) - 1) for val in df['sentence']]
# print(df)
Output:
sentence person day cake world
0 'hello world' Matt 0 0 1
1 'cake, delicious cake!' Matt 0 2 0
2 'lovely day' Maria 1 0 0
3 'i like cake' Matt 0 1 0
4 'a new day' Maria 1 0 0
5 'a new world' Maria 0 0 1
.
# create a simple pivot with what data you needed
df_pivot = pd.pivot_table(df,
values=['day', 'cake', 'world'],
columns=['person'],
aggfunc=np.sum).T
# print(df_pivot)
Final Output:
cake day world
person
Maria 0 2 1
Matt 3 0 1
Open to suggestions if this seems to be a good approach especially given the volume of data. Eager to learn.
since this primarily involves strings, I would suggest taking the computation out of Pandas - Python is faster than Pandas in most cases when it comes to string manipulation:
#read in data
df = pd.read_clipboard(sep='\s{2,}', engine='python')
#create a dictionary of persons and sentences :
from collections import defaultdict, ChainMap
d = defaultdict(list)
for k,v in zip(df.person, df.sentence):
d[k].append(v)
d = {k:",".join(v) for k,v in d.items()}
#search words
strings = ("cake", "world", "day")
#get count of words and create a dict
m = defaultdict(list)
for k,v in d.items():
for st in strings:
m[k].append({st:v.count(st)})
res = {k:dict(ChainMap(*v)) for k,v in m.items()}
print(res)
{'Matt': {'day': 0, 'world': 1, 'cake': 3},
'Maria': {'day': 2, 'world': 1, 'cake': 0}}
output = pd.DataFrame(res).T
day world cake
Matt 0 1 3
Maria 2 1 0
test the speeds and see which one is better. it would be useful for me and others as well.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.