[英]Pandas dataframe: Check if regex contained in a column matches a string in another column in the same row
Input data is a Pandas dataframe:输入数据是 Pandas dataframe:
df = pd.DataFrame()
df['strings'] = ['apple','house','hat','train','tan','note']
df['patterns'] = ['\\ba','\\ba','\\ba','n\\b','n\\b','n\\b']
df['group'] = ['1','1','1','2','2','2']
df
strings patterns group
0 apple \ba 1
1 house \ba 1
2 hat \ba 1
3 train n\b 2
4 tan n\b 2
5 note n\b 2
The patterns
column contains regex. patterns
列包含正则表达式。 \b
is a regex pattern that matches on word boundaries. \b
是匹配单词边界的正则表达式模式。 That means \ba
would match with 'apple' because a
is at the beginning of the word, while it would not match 'hat' because this a
is in the middle of the word.这意味着\ba
会与 'apple' 匹配,因为a
在单词的开头,而它不会匹配 'hat',因为a
在单词的中间。
I want to use the regex in the patterns
column to check if it matches with the strings
column in the same row.我想在patterns
列中使用正则表达式来检查它是否与同一行中的strings
列匹配。
Desired result:期望的结果:
strings patterns group
0 apple \ba 1
3 train n\b 2
4 tan n\b 2
I got it to work below using re.search
and a for loop that loops line by line.我使用re.search
和一个逐行循环的 for 循环让它在下面工作。 But this is very inefficient.但这是非常低效的。 I have millions of rows and this loop takes 5-10 minutes to run.我有数百万行,这个循环需要 5-10 分钟才能运行。
import re
for i in range(len(df)):
pattern = df.at[i,"patterns"]
test_string = df.at[i,"strings"]
if re.search(pattern, test_string):
df.at[i,'match'] = True
else:
df.at[i,'match'] = False
df.loc[df.match]
Is there a way to do something like re.search(df['patterns'], df['strings'])
?有没有办法做类似re.search(df['patterns'], df['strings'])
这样的事情?
This question appears to be similar: Python Pandas: Check if string in one column is contained in string of another column in the same row这个问题似乎是相似的: Python Pandas: Check if string in one column is contained in string of another column in the same row
However, the question and answers in the above link are not using regex to match, and I need to use regex to specify word boundaries.但是上面链接中的问答并没有使用regex来匹配,我需要使用regex来指定分界线。
You can't use a pandas builtin method directly.您不能直接使用 pandas 内置方法。 You will need to apply
a re.search
per row:您将需要对每行apply
re.search
:
import re
mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]
or using a ( faster ) list comprehension:或使用(更快的)列表理解:
mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]
output: output:
strings patterns group
0 apple \ba 1
3 train n\b 2
4 tan n\b 2
Compiling a regex is costly.编译正则表达式的成本很高。 In your example, you only have few regexes, so I would try to cache the compiled regex:在你的例子中,你只有很少的正则表达式,所以我会尝试缓存编译后的正则表达式:
cache = dict()
def check(pattern, string):
try:
x = cache[pattern]
except KeyError:
x = re.compile(pattern)
cache[pattern] = x
return x.search(string)
mask = [bool(check(p, s)) for p, s in zip(df['patterns'], df['strings'])]
print(df.loc[mask])
For your tiny dataframe it is slighly longer than @mozway's solution.对于您的小 dataframe,它比@mozway 的解决方案略长。 But if I replicate it up to 60000 line, it saves up to 30% of execution time.但是如果我将它复制到 60000 行,它最多可以节省 30% 的执行时间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.