[英]How to get list of patterns match in regex using str.contains?
I have a data frame df
which has some text in column Match_text
.我有一个数据框
df
在Match_text
列中有一些文本。 I am matching Match_text
with terms
using regex \b
boundary condition.我使用正则表达式
\b
边界条件将Match_text
与terms
匹配。 I am getting my expected outcome but I also need to print which are the pattern matching with df
.我得到了预期的结果,但我还需要打印与
df
匹配的模式。 In this case, foo
and baz
are matching with \b
.在这种情况下,
foo
和baz
与\b
匹配。 How I get these terms also?我如何也得到这些条款?
texts = ['foo abc', 'foobar xyz', 'xyz baz32', 'baz 45','fooz','bazzar','foo baz']
terms = ['foo','ball','baz','apple']
df = pd.DataFrame({'Match_text': texts})
pat = r'\b(?:{})\b'.format('|'.join(terms))
df[df['Match_text'].str.contains(pat)]
The output is output 是
Match_text
0 foo abc
3 baz 45
6 foo baz
along with this output I also need foo, baz,
and foo
除了这个 output 我还需要
foo, baz,
和foo
One approach would be to add a new column to your current resulting data frame which contains only matching terms, with all other non matching words removed:一种方法是在当前生成的数据框中添加一个新列,该列仅包含匹配的术语,并删除所有其他不匹配的单词:
terms_regex = r'(?:{})'.format('|'.join(terms))
df['Match_terms'] = re.sub(r'\s*\b(?!' + pat1 + r')\S+\b\s*', '', df['Match_text']
To be clear here, the regex I am using to remove the non matching words is:在这里要清楚,我用来删除不匹配单词的正则表达式是:
\s*\b(?!(?:foo|ball|baz|apple))\S+\b\s*
This will match any term which is not one of your keywords, along with optional surrounding whitespace, replacing it with empty string.这将匹配任何不是您的关键字之一的术语,以及可选的周围空格,并将其替换为空字符串。
A bit verbose IMHO, lemme know if it meets ur use case:有点冗长恕我直言,让我知道它是否符合您的用例:
df['content'] = df[df['Match_text'].str.contains(pat)]
(df
.dropna()
.assign(temp = lambda x: x.content.str.split())
.explode('temp')
.reset_index()
.assign(present=lambda x: x.loc[x.temp.isin(terms),'temp'])
.dropna()
.drop(['temp','content'],axis=1)
)
index Match_text present
0 0 foo abc foo
2 3 baz 45 baz
4 6 foo baz foo
5 6 foo baz baz
Alternatively, you could use some regex:或者,您可以使用一些正则表达式:
M = df.loc[df['Match_text'].str.contains(pat)]
#create pattern
p = re.compile(pat)
#search for pattern in the column
results = [p.findall(text) for text in M.Match_text.tolist()]
#assign results to a new column
M = M.assign(content = results)
M
Match_text content
0 foo abc [foo]
3 baz 45 [baz]
6 foo baz [foo, baz]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.