简体   繁体   中英

How to get list of patterns match in regex using str.contains?

I have a data frame df which has some text in column Match_text . I am matching Match_text with terms using regex \b boundary condition. I am getting my expected outcome but I also need to print which are the pattern matching with df . In this case, foo and baz are matching with \b . How I get these terms also?

texts = ['foo abc', 'foobar xyz', 'xyz baz32', 'baz 45','fooz','bazzar','foo baz']
terms = ['foo','ball','baz','apple']
df = pd.DataFrame({'Match_text': texts})
pat = r'\b(?:{})\b'.format('|'.join(terms))
df[df['Match_text'].str.contains(pat)]

The output is

    Match_text
0   foo abc
3   baz 45
6   foo baz

along with this output I also need foo, baz, and foo

One approach would be to add a new column to your current resulting data frame which contains only matching terms, with all other non matching words removed:

terms_regex = r'(?:{})'.format('|'.join(terms))
df['Match_terms'] = re.sub(r'\s*\b(?!' + pat1 + r')\S+\b\s*', '', df['Match_text']

To be clear here, the regex I am using to remove the non matching words is:

\s*\b(?!(?:foo|ball|baz|apple))\S+\b\s*

This will match any term which is not one of your keywords, along with optional surrounding whitespace, replacing it with empty string.

A bit verbose IMHO, lemme know if it meets ur use case:

df['content'] = df[df['Match_text'].str.contains(pat)]
(df
 .dropna()
 .assign(temp = lambda x: x.content.str.split())
 .explode('temp')
 .reset_index()
 .assign(present=lambda x: x.loc[x.temp.isin(terms),'temp'])
 .dropna()
 .drop(['temp','content'],axis=1)
)

 index  Match_text  present
0   0   foo abc      foo
2   3   baz 45       baz
4   6   foo baz      foo
5   6   foo baz      baz

Alternatively, you could use some regex:

   M = df.loc[df['Match_text'].str.contains(pat)]

#create pattern
p = re.compile(pat)

#search for pattern in the column
results = [p.findall(text) for text in M.Match_text.tolist()]

#assign results to a new column
M = M.assign(content = results)

M

        Match_text  content
0        foo abc    [foo]
3        baz 45     [baz]
6        foo baz    [foo, baz]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM