Pandas dataframe：檢查列中包含的正則表達式是否與同一行中另一列中的字符串匹配

Question

輸入數據是 Pandas dataframe：

df = pd.DataFrame()
df['strings'] = ['apple','house','hat','train','tan','note']
df['patterns'] = ['\\ba','\\ba','\\ba','n\\b','n\\b','n\\b']
df['group'] = ['1','1','1','2','2','2']

df

    strings patterns    group
0   apple   \ba         1
1   house   \ba         1
2   hat     \ba         1
3   train   n\b         2
4   tan     n\b         2
5   note    n\b         2

patterns列包含正則表達式。 \b是匹配單詞邊界的正則表達式模式。 這意味着\ba會與 'apple' 匹配，因為a在單詞的開頭，而它不會匹配 'hat'，因為a在單詞的中間。

我想在patterns列中使用正則表達式來檢查它是否與同一行中的strings列匹配。

期望的結果：

    strings patterns    group
0   apple   \ba         1
3   train   n\b         2
4   tan     n\b         2

我使用re.search和一個逐行循環的 for 循環讓它在下面工作。 但這是非常低效的。 我有數百萬行，這個循環需要 5-10 分鍾才能運行。

import re
for i in range(len(df)):
  pattern = df.at[i,"patterns"]
  test_string = df.at[i,"strings"]
  if re.search(pattern, test_string):
    df.at[i,'match'] = True
  else:
    df.at[i,'match'] = False

df.loc[df.match]

有沒有辦法做類似re.search(df['patterns'], df['strings'])這樣的事情？

這個問題似乎是相似的： Python Pandas: Check if string in one column is contained in string of another column in the same row

但是上面鏈接中的問答並沒有使用regex來匹配，我需要使用regex來指定分界線。

Answer 1

您不能直接使用 pandas 內置方法。 您將需要對每行apply re.search ：

import re

mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]

或使用（更快的）列表理解：

mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]

output：

  strings patterns group
0   apple      \ba     1
3   train      n\b     2
4     tan      n\b     2

Answer 2

編譯正則表達式的成本很高。 在你的例子中，你只有很少的正則表達式，所以我會嘗試緩存編譯后的正則表達式：

cache = dict()
def check(pattern, string):
    try:
        x = cache[pattern]
    except KeyError:
        x = re.compile(pattern)
        cache[pattern] = x
    return x.search(string)
mask = [bool(check(p, s)) for p, s in zip(df['patterns'], df['strings'])]
print(df.loc[mask])

對於您的小 dataframe，它比@mozway 的解決方案略長。 但是如果我將它復制到 60000 行，它最多可以節省 30% 的執行時間。

Pandas dataframe：檢查列中包含的正則表達式是否與同一行中另一列中的字符串匹配

問題描述

2 個解決方案

解決方案1
3 已采納 2022-02-09 12:38:30

解決方案2
1 2022-02-09 13:18:35

Pandas dataframe：檢查列中包含的正則表達式是否與同一行中另一列中的字符串匹配

問題描述

2 個解決方案

解決方案1 3 已采納 2022-02-09 12:38:30

解決方案2 1 2022-02-09 13:18:35

解決方案1
3 已采納 2022-02-09 12:38:30

解決方案2
1 2022-02-09 13:18:35