简体   繁体   English

为什么.str.contains()在这里找不到部分匹配项? (熊猫数据框)

[英]Why does .str.contains() not find partial matches here? (Pandas dataframe)

Pandas dataframe "df1" has a column ("Receiver") with string values. 熊猫数据框“ df1”具有带有字符串值的列(“ Receiver”)。

df1
    Receiver
44  BANK
106 restaurant
149 Tax office
63  house
55  car insurance

I want to go through each row of that column, check if they match with values (mostly one- or two-word search terms) in another dataframe ("df2") and return the matching column's title on the correct rows. 我想遍历该列的每一行,检查它们是否与另一个数据帧(“ df2”)中的值(主要是一词或两词搜索词)匹配,然后在正确的行上返回匹配的列标题。 I'm trying to do it with the following function: 我正在尝试使用以下功能:

df1.Receiver.apply(lambda x:
                               ''.join([i for i in df2.columns 
                               if df2.loc[:,i].str.contains(x).any()]) 
                               )

Problem: However, this only works for values in df1's "Receiver" column that consist of just one word (so "BANK", "restaurant" and "house" work in this case). 问题:但是,这仅适用于df1的“接收器”列中包含一个单词的值 (因此,在这种情况下,“银行”,“餐厅”和“房屋”工作)。

Values with two or more words do not work ("Tax office" and "car insurance" in this case). 两个或两个以上单词的值不起作用 (在这种情况下为“税收办公室”和“汽车保险”)。

Isn't str.contains() supposed to find also partial matches? str.contains()是否也应该找到部分匹配项吗? How can I find partial matches also for values in the "Receiver" column that have two or more words? 如何在“接收器”列中具有两个或多个单词的值中找到部分匹配项?

edit: here's how df2 looks like, it has different categories as column titles, and then each column has the search terms as values 编辑:这是df2的样子,它具有不同的类别作为列标题,然后每个列都有搜索项作为值

df2
    Banks    Restaurants   Car           House
0   BANK     restaurant    car           house
1   bank     mcdonalds     
2            Subway                 

Here is the whole problem in a single image, the output can be seen on the right, and categories "Car" and "Tax office" are not found because the receivers "car insurance" and "Tax office" (receiver column in df1) are only partial matches with the search terms "car" and "Tax" (values in df2's columns "Car" and "Tax office". 这是单个问题中的整个问题,可以在右侧看到输出,并且未找到类别“汽车”和“税收办公室”,因为接收者“汽车保险”和“税收办公室” (df1中的接收者列)仅与搜索字词“汽车”和“税收”部分匹配 (df2列“汽车”和“税收办公室”中的值。 在此处输入图片说明

Instead of iterating your dataframe rows, you can iterate columns of df2 and use regex with pd.Series.str.contains : 您可以迭代df2列,并将regex与pd.Series.str.contains一起使用,而不是迭代数据pd.Series.str.contains

df1 = pd.DataFrame({'Receiver': ['BANK', 'restaurant house', 'Tax office', 'mcdonalds car']})

df1['Receiver_new'] = ''
for col in df2:
    values = '|'.join(df2[col].dropna())
    bool_series = df1['Receiver'].str.contains(values)
    df1.loc[bool_series, 'Receiver_new'] += f'{col}|'

print(df1)

#            Receiver        Receiver_new
# 0              BANK              Banks|
# 1  restaurant house  Restaurants|House|
# 2        Tax office                    
# 3     mcdonalds car    Restaurants|Car|

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM