简体   繁体   English

如何使用 str.contains 在正则表达式中获取模式匹配列表?

[英]How to get list of patterns match in regex using str.contains?

I have a data frame df which has some text in column Match_text .我有一个数据框dfMatch_text列中有一些文本。 I am matching Match_text with terms using regex \b boundary condition.我使用正则表达式\b边界条件将Match_textterms匹配。 I am getting my expected outcome but I also need to print which are the pattern matching with df .我得到了预期的结果,但我还需要打印与df匹配的模式。 In this case, foo and baz are matching with \b .在这种情况下, foobaz\b匹配。 How I get these terms also?我如何也得到这些条款?

texts = ['foo abc', 'foobar xyz', 'xyz baz32', 'baz 45','fooz','bazzar','foo baz']
terms = ['foo','ball','baz','apple']
df = pd.DataFrame({'Match_text': texts})
pat = r'\b(?:{})\b'.format('|'.join(terms))
df[df['Match_text'].str.contains(pat)]

The output is output 是

    Match_text
0   foo abc
3   baz 45
6   foo baz

along with this output I also need foo, baz, and foo除了这个 output 我还需要foo, baz,foo

One approach would be to add a new column to your current resulting data frame which contains only matching terms, with all other non matching words removed:一种方法是在当前生成的数据框中添加一个新列,该列仅包含匹配的术语,并删除所有其他不匹配的单词:

terms_regex = r'(?:{})'.format('|'.join(terms))
df['Match_terms'] = re.sub(r'\s*\b(?!' + pat1 + r')\S+\b\s*', '', df['Match_text']

To be clear here, the regex I am using to remove the non matching words is:在这里要清楚,我用来删除不匹配单词的正则表达式是:

\s*\b(?!(?:foo|ball|baz|apple))\S+\b\s*

This will match any term which is not one of your keywords, along with optional surrounding whitespace, replacing it with empty string.这将匹配任何不是您的关键字之一的术语,以及可选的周围空格,并将其替换为空字符串。

A bit verbose IMHO, lemme know if it meets ur use case:有点冗长恕我直言,让我知道它是否符合您的用例:

df['content'] = df[df['Match_text'].str.contains(pat)]
(df
 .dropna()
 .assign(temp = lambda x: x.content.str.split())
 .explode('temp')
 .reset_index()
 .assign(present=lambda x: x.loc[x.temp.isin(terms),'temp'])
 .dropna()
 .drop(['temp','content'],axis=1)
)

 index  Match_text  present
0   0   foo abc      foo
2   3   baz 45       baz
4   6   foo baz      foo
5   6   foo baz      baz

Alternatively, you could use some regex:或者,您可以使用一些正则表达式:

   M = df.loc[df['Match_text'].str.contains(pat)]

#create pattern
p = re.compile(pat)

#search for pattern in the column
results = [p.findall(text) for text in M.Match_text.tolist()]

#assign results to a new column
M = M.assign(content = results)

M

        Match_text  content
0        foo abc    [foo]
3        baz 45     [baz]
6        foo baz    [foo, baz]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 python 正则表达式中使用 str.contains 获取所有匹配项? - How get all matches using str.contains in python regex? Pandas:str。包含正则表达式 - Pandas: str.contains using regex 如何使用正则表达式参数返回 pandas str.contains 中的匹配关键字? - how to return matched keywords in the pandas str.contains using regex parameter? 通过混合 AND 和 OR 使用 str.contains 识别子字符串 - identify substring using str.contains by mixing AND and OR 使用 str.contains 使用正则表达式检查列中的数值时出错 - Error while using str.contains for checking numeric values in a column using regex 在 dataframe 中使用 str.contains 和正则表达式搜索单词很慢,有没有更好的方法? - Searching for words using str.contains and regex in dataframe is slow, is there a better way? 为什么我的正则表达式不能与 str.contains 一起使用? - Why isn't my regex working with str.contains? 使用 python 循环遍历列表并插入到 str.contains 中(并计算存在多个项目的 df 行) - looping through a list and inserting into str.contains (and counting rows of a df where multiple items are present) using python 如何放置| 符号来匹配使用stringr :: str_match的两个正则表达式模式中的任何一个? - How to place the | symbol to match either of two regex patterns using stringr::str_match? 包含熊猫字符串列表的str.contains可扩展解决方案 - Scalable solution for str.contains with list of strings in pandas
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM