Pandas dataframe：检查列中包含的正则表达式是否与同一行中另一列中的字符串匹配

Question

Input data is a Pandas dataframe:输入数据是 Pandas dataframe：

df = pd.DataFrame()
df['strings'] = ['apple','house','hat','train','tan','note']
df['patterns'] = ['\\ba','\\ba','\\ba','n\\b','n\\b','n\\b']
df['group'] = ['1','1','1','2','2','2']

df

    strings patterns    group
0   apple   \ba         1
1   house   \ba         1
2   hat     \ba         1
3   train   n\b         2
4   tan     n\b         2
5   note    n\b         2

The patterns column contains regex. patterns列包含正则表达式。 \b is a regex pattern that matches on word boundaries. \b是匹配单词边界的正则表达式模式。 That means \ba would match with 'apple' because a is at the beginning of the word, while it would not match 'hat' because this a is in the middle of the word.这意味着\ba会与 'apple' 匹配，因为a在单词的开头，而它不会匹配 'hat'，因为a在单词的中间。

I want to use the regex in the patterns column to check if it matches with the strings column in the same row.我想在patterns列中使用正则表达式来检查它是否与同一行中的strings列匹配。

Desired result:期望的结果：

    strings patterns    group
0   apple   \ba         1
3   train   n\b         2
4   tan     n\b         2

I got it to work below using re.search and a for loop that loops line by line.我使用re.search和一个逐行循环的 for 循环让它在下面工作。 But this is very inefficient.但这是非常低效的。 I have millions of rows and this loop takes 5-10 minutes to run.我有数百万行，这个循环需要 5-10 分钟才能运行。

import re
for i in range(len(df)):
  pattern = df.at[i,"patterns"]
  test_string = df.at[i,"strings"]
  if re.search(pattern, test_string):
    df.at[i,'match'] = True
  else:
    df.at[i,'match'] = False

df.loc[df.match]

Is there a way to do something like re.search(df['patterns'], df['strings']) ?有没有办法做类似re.search(df['patterns'], df['strings'])这样的事情？

This question appears to be similar: Python Pandas: Check if string in one column is contained in string of another column in the same row这个问题似乎是相似的： Python Pandas: Check if string in one column is contained in string of another column in the same row

However, the question and answers in the above link are not using regex to match, and I need to use regex to specify word boundaries.但是上面链接中的问答并没有使用regex来匹配，我需要使用regex来指定分界线。

Answer 1

You can't use a pandas builtin method directly.您不能直接使用 pandas 内置方法。 You will need to apply a re.search per row:您将需要对每行apply re.search ：

import re

mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]

or using a ( faster ) list comprehension:或使用（更快的）列表理解：

mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]

output: output：

  strings patterns group
0   apple      \ba     1
3   train      n\b     2
4     tan      n\b     2

Answer 2

Compiling a regex is costly.编译正则表达式的成本很高。 In your example, you only have few regexes, so I would try to cache the compiled regex:在你的例子中，你只有很少的正则表达式，所以我会尝试缓存编译后的正则表达式：

cache = dict()
def check(pattern, string):
    try:
        x = cache[pattern]
    except KeyError:
        x = re.compile(pattern)
        cache[pattern] = x
    return x.search(string)
mask = [bool(check(p, s)) for p, s in zip(df['patterns'], df['strings'])]
print(df.loc[mask])

For your tiny dataframe it is slighly longer than @mozway's solution.对于您的小 dataframe，它比@mozway 的解决方案略长。 But if I replicate it up to 60000 line, it saves up to 30% of execution time.但是如果我将它复制到 60000 行，它最多可以节省 30% 的执行时间。

Pandas dataframe：检查列中包含的正则表达式是否与同一行中另一列中的字符串匹配

问题描述

2 个解决方案

解决方案1
3 已采纳 2022-02-09 12:38:30

解决方案2
1 2022-02-09 13:18:35

Pandas dataframe：检查列中包含的正则表达式是否与同一行中另一列中的字符串匹配

问题描述

2 个解决方案

解决方案1 3 已采纳 2022-02-09 12:38:30

解决方案2 1 2022-02-09 13:18:35

解决方案1
3 已采纳 2022-02-09 12:38:30

解决方案2
1 2022-02-09 13:18:35