简体   繁体   English

Pandas dataframe:检查列中包含的正则表达式是否与同一行中另一列中的字符串匹配

[英]Pandas dataframe: Check if regex contained in a column matches a string in another column in the same row

Input data is a Pandas dataframe:输入数据是 Pandas dataframe:

df = pd.DataFrame()
df['strings'] = ['apple','house','hat','train','tan','note']
df['patterns'] = ['\\ba','\\ba','\\ba','n\\b','n\\b','n\\b']
df['group'] = ['1','1','1','2','2','2']

df

    strings patterns    group
0   apple   \ba         1
1   house   \ba         1
2   hat     \ba         1
3   train   n\b         2
4   tan     n\b         2
5   note    n\b         2

The patterns column contains regex. patterns列包含正则表达式。 \b is a regex pattern that matches on word boundaries. \b是匹配单词边界的正则表达式模式。 That means \ba would match with 'apple' because a is at the beginning of the word, while it would not match 'hat' because this a is in the middle of the word.这意味着\ba会与 'apple' 匹配,因为a在单词的开头,而它不会匹配 'hat',因为a在单词的中间。

I want to use the regex in the patterns column to check if it matches with the strings column in the same row.我想在patterns列中使用正则表达式来检查它是否与同一行中的strings列匹配。

Desired result:期望的结果:

    strings patterns    group
0   apple   \ba         1
3   train   n\b         2
4   tan     n\b         2

I got it to work below using re.search and a for loop that loops line by line.我使用re.search和一个逐行循环的 for 循环让它在下面工作。 But this is very inefficient.但这是非常低效的。 I have millions of rows and this loop takes 5-10 minutes to run.我有数百万行,这个循环需要 5-10 分钟才能运行。

import re
for i in range(len(df)):
  pattern = df.at[i,"patterns"]
  test_string = df.at[i,"strings"]
  if re.search(pattern, test_string):
    df.at[i,'match'] = True
  else:
    df.at[i,'match'] = False

df.loc[df.match]

Is there a way to do something like re.search(df['patterns'], df['strings']) ?有没有办法做类似re.search(df['patterns'], df['strings'])这样的事情?

This question appears to be similar: Python Pandas: Check if string in one column is contained in string of another column in the same row这个问题似乎是相似的: Python Pandas: Check if string in one column is contained in string of another column in the same row

However, the question and answers in the above link are not using regex to match, and I need to use regex to specify word boundaries.但是上面链接中的问答并没有使用regex来匹配,我需要使用regex来指定分界线。

You can't use a pandas builtin method directly.您不能直接使用 pandas 内置方法。 You will need to apply a re.search per row:您将需要对每行apply re.search

import re

mask = df.apply(lambda r: bool(re.search(r['patterns'], r['strings'])), axis=1)
df2 = df[mask]

or using a ( faster ) list comprehension:或使用(更快的)列表理解:

mask = [bool(re.search(p,s)) for p,s in zip(df['patterns'], df['strings'])]

output: output:

  strings patterns group
0   apple      \ba     1
3   train      n\b     2
4     tan      n\b     2

Compiling a regex is costly.编译正则表达式的成本很高。 In your example, you only have few regexes, so I would try to cache the compiled regex:在你的例子中,你只有很少的正则表达式,所以我会尝试缓存编译后的正则表达式:

cache = dict()
def check(pattern, string):
    try:
        x = cache[pattern]
    except KeyError:
        x = re.compile(pattern)
        cache[pattern] = x
    return x.search(string)
mask = [bool(check(p, s)) for p, s in zip(df['patterns'], df['strings'])]
print(df.loc[mask])

For your tiny dataframe it is slighly longer than @mozway's solution.对于您的小 dataframe,它比@mozway 的解决方案略长。 But if I replicate it up to 60000 line, it saves up to 30% of execution time.但是如果我将它复制到 60000 行,它最多可以节省 30% 的执行时间。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pandas dataframe 如何判断一行的字符串值是否包含在同一列的另一行的字符串值中 - How to check if a string value of one row is contained in the string value of another row in the same column in pandas dataframe Python Pandas:检查一列中的字符串是否包含在同一行中另一列的字符串中 - Python Pandas: Check if string in one column is contained in string of another column in the same row Pandas dataframe 检查字符串的左侧部分是否与列中的另一个条目匹配 - Pandas dataframe check if left part of a string matches another entry in a column 如果一个字符串列包含在 Pandas 的另一列中,则合并两个数据框 - Merge two dataframe if one string column is contained in another column in Pandas Pandas:检查一列中的字符串值是否是 dataframe 同一行中另一列的字符串的一部分 - 当前脚本返回全部是 - Pandas: check if string value in one column is part of string of another column in same row of dataframe - current script returning all Yes 如何检查PANDAS DataFrame列中是否包含一系列字符串,并将该字符串分配为行中的新列? - How to check if a series of strings is contained in a PANDAS DataFrame columns and assign that string as a new column in the row? Append 如果某个列匹配,则 pandas 行包含来自另一行 dataframe 的数据 - Append a pandas row with data from another dataframe if a certain column matches Python数据框:根据数据框另一列的字符串行中是否包含列名,将行填充为1或0 - Python Dataframe: Fill in Row as 1 or 0 Based on If Column Name is Contained in String Row of Another Column in Dataframe How to create a function that checks if one row in a PySpark column of a dataframe matches another row in the same column of another dataframe? - How to create a function that checks if one row in a PySpark column of a dataframe matches another row in the same column of another dataframe? 如何检查一个 Pandas 列的字符串值是否包含在另一个 Pandas 列的字符串值中? - How to check whether the string value of a Pandas Column is contained in the string value of another Pandas Column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM