Input data is a Pandas dataframe:
df = pd.DataFrame()
df['strings'] = ['apple','house','hat','train','tan','note']
df['patterns'] = ['\\ba','\\ba','\\ba','n\\b','n\\b','n\\b']
df['group'] = ['1','1','1','2','2','2']
df
strings patterns group
0 apple \ba 1
1 house \ba 1
2 hat \ba 1
3 train n\b 2
4 tan n\b 2
5 note n\b 2
The patterns
column contains regex. \b
is a regex pattern that matches on word boundaries. That means \ba
would match with 'apple' because a
is at the beginning of the word, while it would not match 'hat' because this a
is in the middle of the word.
I want to use the regex in the patterns
column to check if it matches with the strings
column in the same row.
Desired result:
strings patterns group
0 apple \ba 1
3 train n\b 2
4 tan n\b 2
I got it to work below using re.search
and a for loop that loops line by line. But this is very inefficient. I have millions of rows and this loop takes 5-10 minutes to run.
import re
for i in range(len(df)):
pattern = df.at[i,"patterns"]
test_string = df.at[i,"strings"]
if re.search(pattern, test_string):
df.at[i,'match'] = True
else:
df.at[i,'match'] = False
df.loc[df.match]
Is there a way to do something like re.search(df['patterns'], df['strings'])
?
This question appears to be similar: Python Pandas: Check if string in one column is contained in string of another column in the same row
However, the question and answers in the above link are not using regex to match, and I need to use regex to specify word boundaries.
You can't use a pandas builtin method directly. You will need to apply
a re.search
per row:
import re
mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]
or using a ( faster ) list comprehension:
mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]
output:
strings patterns group
0 apple \ba 1
3 train n\b 2
4 tan n\b 2
Compiling a regex is costly. In your example, you only have few regexes, so I would try to cache the compiled regex:
cache = dict()
def check(pattern, string):
try:
x = cache[pattern]
except KeyError:
x = re.compile(pattern)
cache[pattern] = x
return x.search(string)
mask = [bool(check(p, s)) for p, s in zip(df['patterns'], df['strings'])]
print(df.loc[mask])
For your tiny dataframe it is slighly longer than @mozway's solution. But if I replicate it up to 60000 line, it saves up to 30% of execution time.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.