简体   繁体   English

如何在 python 正则表达式中使用 str.contains 获取所有匹配项?

[英]How get all matches using str.contains in python regex?

I have a data frame, in which I need to find all the possible matches rows which match with terms .我有一个数据框,我需要在其中找到与terms匹配的所有可能匹配行。 My code is我的代码是

texts = ['foo abc', 'foobar xyz', 'xyz baz32', 'baz 45','fooz','bazzar','foo baz']
terms = ['foo','baz','foo baz']
# create df
df = pd.DataFrame({'Match_text': texts})
#cretae pattern 
pat = r'\b(?:{})\b'.format('|'.join(terms))
# use str.contains to find matchs
df = df[df['Match_text'].str.contains(pat)]

#create pattern
p = re.compile(pat)

#search for pattern in the column
results = [p.findall(text) for text in df.Match_text.tolist()]
df['results'] = results

The output is output 是

Match_text  results
0   foo abc     [foo]
3   baz 45      [baz]
6   foo baz     [foo, baz]

In which, foo baz is also matching with row 6 along with foo , and baz .其中, foo baz还与第 6 行以及foobaz匹配。 I need to get rows for all matches which are in the terms我需要获取terms中所有匹配项的行

The longer alternatives should come before the shorter ones, thus, you need to sort the keywords by length in the descending order:较长的替代品应该在较短的替代品之前,因此,您需要按长度按降序对关键字进行排序:

pat = r'\b(?:{})\b'.format('|'.join(sorted(terms,key=len,reverse=True)))

The result will be \b(?:foo baz|foo|baz)\b pattern.结果将是\b(?:foo baz|foo|baz)\b模式。 It will first try to match foo baz , then foo , then baz .它将首先尝试匹配foo baz ,然后是foo ,然后是baz If foo baz is found, the match is returned, then the next match is searched for from the end of the match, so you won't match foo or baz found with the previous match again.如果找到foo baz ,则返回匹配项,然后从匹配项的末尾开始搜索下一个匹配项,因此您不会再次将找到的foobaz与上一个匹配项匹配。

See more on this in " Remember That The Regex Engine Is Eager " .记住正则表达式引擎是急切的”中查看更多信息。

Instead of using the regex pattern for checking the presence of terms,而不是使用正则表达式模式来检查术语的存在,

#create pattern
p = re.compile(pat)

#search for pattern in the column
results = [p.findall(text) for text in df.Match_text.tolist()]

Try using a simple lookup of terms in the text like this.尝试像这样在文本中使用简单的术语查找。

#search for each term in the column
results = [[term for term in terms if term in text] for text in df.Match_text.tolist()]

Output for the above looks like this,上面的 Output 看起来像这样,

    Match_text  results
0   foo abc [foo]
3   baz 45  [baz]
6   foo baz [foo, baz, foo baz]

NOTE: There is a time complexity associated to this method.注意:此方法存在时间复杂度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM