简体   繁体   中英

Pandas dataframe: Check if regex contained in a column matches a string in another column in the same row

Input data is a Pandas dataframe:

df = pd.DataFrame()
df['strings'] = ['apple','house','hat','train','tan','note']
df['patterns'] = ['\\ba','\\ba','\\ba','n\\b','n\\b','n\\b']
df['group'] = ['1','1','1','2','2','2']

df

    strings patterns    group
0   apple   \ba         1
1   house   \ba         1
2   hat     \ba         1
3   train   n\b         2
4   tan     n\b         2
5   note    n\b         2

The patterns column contains regex. \b is a regex pattern that matches on word boundaries. That means \ba would match with 'apple' because a is at the beginning of the word, while it would not match 'hat' because this a is in the middle of the word.

I want to use the regex in the patterns column to check if it matches with the strings column in the same row.

Desired result:

    strings patterns    group
0   apple   \ba         1
3   train   n\b         2
4   tan     n\b         2

I got it to work below using re.search and a for loop that loops line by line. But this is very inefficient. I have millions of rows and this loop takes 5-10 minutes to run.

import re
for i in range(len(df)):
  pattern = df.at[i,"patterns"]
  test_string = df.at[i,"strings"]
  if re.search(pattern, test_string):
    df.at[i,'match'] = True
  else:
    df.at[i,'match'] = False

df.loc[df.match]

Is there a way to do something like re.search(df['patterns'], df['strings']) ?

This question appears to be similar: Python Pandas: Check if string in one column is contained in string of another column in the same row

However, the question and answers in the above link are not using regex to match, and I need to use regex to specify word boundaries.

You can't use a pandas builtin method directly. You will need to apply a re.search per row:

import re

mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]

or using a ( faster ) list comprehension:

mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]

output:

  strings patterns group
0   apple      \ba     1
3   train      n\b     2
4     tan      n\b     2

Compiling a regex is costly. In your example, you only have few regexes, so I would try to cache the compiled regex:

cache = dict()
def check(pattern, string):
    try:
        x = cache[pattern]
    except KeyError:
        x = re.compile(pattern)
        cache[pattern] = x
    return x.search(string)
mask = [bool(check(p, s)) for p, s in zip(df['patterns'], df['strings'])]
print(df.loc[mask])

For your tiny dataframe it is slighly longer than @mozway's solution. But if I replicate it up to 60000 line, it saves up to 30% of execution time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM