简体   繁体   English

pandas str.contains match 完全匹配 substring 不适用于正则表达式 boudry

[英]pandas str.contains match exact substring not working with regex boudry

I have two dataframes, and trying to find out a way to match the exact substring from one dataframe to another dataframe.我有两个数据框,并试图找到一种方法来匹配从一个 dataframe 到另一个 dataframe 的确切 substring。

First DataFrame :首先 DataFrame

import pandas as pd
import numpy as np

random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl', 'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 
              'Site':['DV360', 'Adikteev']}
        
dataframe = pd.DataFrame(random_data)
print(dataframe)

Second DataFrame第二 DataFrame

test_data = {'code name': ['PB', 'PB', 'PB'], 
             'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
             'code':['progra', 'emo', 'prog']}

test_dataframe = pd.DataFrame(test_data)

Approach方法

for k, l, m in zip(test_dataframe.iloc[:, 0], test_dataframe.iloc[:, 1], test_dataframe.iloc[:, 2]):
    dataframe['Site'] = np.select([dataframe['Place Name'].str.contains(r'\b{}~{}\b'.format(k, m), regex=False)], [l],
                                  default=dataframe['Site'])

The current output is as below, though I am expecting to match the exact substring, which is not working with the code above.当前的 output 如下所示,尽管我希望与上面的代码不匹配的确切 substring 匹配。

Current Output:当前 Output:

Place Name                        Site
TS~HOT_MD~h_PB~progra_VV~gogl     programmatic-mechanics
FM~uiosv_PB~emo_SZ~1x1_TG~bhv     emoteev

Expected Output:预期 Output:

Place Name                        Site
TS~HOT_MD~h_PB~progra_VV~gogl     programmatic me
FM~uiosv_PB~emo_SZ~1x1_TG~bhv     emoteev

Data数据

import pandas as pd
 import numpy as np
 random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
                                     'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}
    dataframe = pd.DataFrame(random_data)

    test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
                 'code':['progra', 'emo', 'prog']}
    test_dataframe  = pd.DataFrame(test_data)

Map the test_datframe code and Actual into dictionary as key and value respectively Map 将test_datframe codeActual分别作为keyvalue放入字典

keys=test_dataframe['code'].values.tolist()

dicto=dict(zip(test_dataframe.code, test_dataframe.Actual))
dicto

Join the keys separated by |加入由|分隔的键to enable search of either phrases启用任一短语的搜索

k = '|'.join(r"{}".format(x) for x in dicto.keys())
k

Extract string from datframe meeting any of the phrases in k and map them to to the dictionary从 datframe 中提取符合 k 和 map 中任何短语的字符串到字典

dataframe['Site'] = dataframe['Place Name'].str.extract('('+ k + ')', expand=False).map(dicto)
dataframe

Output Output

在此处输入图像描述

Not the most elegant solution, but this does the trick.不是最优雅的解决方案,但这可以解决问题。

Set up data设置数据

import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
                                'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}

dataframe = pd.DataFrame(random_data)

test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
             'code':['progra', 'emo', 'prog']}

test_dataframe = pd.DataFrame(test_data)

Solution解决方案

Create a column in test_dataframe with the substring to match:test_dataframe中使用 ZE83AED3DDF4667DEC0DAAAACB2BB3BE0BZ 创建一列以匹配:

test_dataframe['match_str'] = test_dataframe['code name'] + '~' + test_dataframe.code

print(test_dataframe)
  code name                  Actual    code  match_str
0        PB         programmatic me  progra  PB~progra
1        PB                 emoteev     emo     PB~emo
2        PB  programmatic-mechanics    prog    PB~prog

Define a function to apply to test_dataframe :定义一个 function 以应用于test_dataframe

def match_string(row, dataframe):
    ind = row.name
    try:
        if row[-1] in dataframe.loc[ind, 'Place Name']:
            return row[1]
        else:
            return dataframe.loc[ind, 'Site']
    except KeyError:
        # More rows in test_dataframe than there are in dataframe
        pass

# Apply match_string and assign back to dataframe
dataframe['Site'] = test_dataframe.apply(match_string, args=(dataframe,), axis=1)

Output: Output:

                      Place Name             Site
0  TS~HOT_MD~h_PB~progra_VV~gogl  programmatic me
1  FM~uiosv_PB~emo_SZ~1x1_TG~bhv          emoteev

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM