简体   繁体   English

如何使用正则表达式参数返回 pandas str.contains 中的匹配关键字?

[英]how to return matched keywords in the pandas str.contains using regex parameter?

This is my sample code:这是我的示例代码:

import pandas as pd

df = pd.DataFrame({'A':
                       ['btcrr',
                        'You have crypto here',
                        'coinbase.com was there ',
                        'hotwalletint']
                   })

regex = r"(^|\W)(?:btc|crypto|coinbase|hotwallet)[^A-Za-z0-9]"
tagged_df = df[df['A'].str.contains(regex, na=False, regex=True, case=False)]

The output of tagged_df : tagged_df的 output :

   A
1  You have crypto here
2  coinbase.com was there 

In this case, this will return only if it matches the regex that I gave.在这种情况下,只有当它与我给出的正则表达式匹配时才会返回。 But I want the pandas to return the matched keyword.但我希望 pandas 返回匹配的关键字。 I am expecting something like this to return in tagged_df我期待这样的东西会在tagged_df中返回

The Expected output of tagged_df : tagged_df的预期 output :

   A
1  crypto
2  coinbase.com

If pandas do not have the ability, Please suggest alternates that can solve this case.如果 pandas 没有能力,请建议可以解决这种情况的替代方案。

Use pandas.Series.str.extract() .使用pandas.Series.str.extract() For each capture group in the regular expession (a non-capture group is just a group with ?: at the beginning, eg (?:abc) ), a new colum will be created containing the matched value for that group, for that row.对于正则表达式中的每个捕获组(非捕获组只是一个以?:开头的组,例如(?:abc) ),将为该行创建一个包含该组匹配值的新列. You can also Add ?P<your_name> to the very beginning of a capture group to name the outputted column associated with that group:您还可以将?P<your_name>添加到捕获组的开头,以命名与该组关联的输出列:

new_df = df['A'].str.extract(r'(?:^|\W)(?P<A>btc|crypto|coinbase|hotwallet)[^A-Za-z0-9]')

Output: Output:

>>> new_df
          A
0       NaN
1    crypto
2  coinbase
3       NaN

>>> new_df.dropna()
          A
1    crypto
2  coinbase

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM