简体   繁体   English

pandas 中列两侧的部分字符串匹配

[英]Partial String match on both side of the columns in pandas

[Code] [代码]

d = {
    'ID': ['1', '4', '5', '9'],
    'username': ['haabi.g', 'pugal.g', 'janani.g', 'hajacob.h'],
    'email': ['abi@gmail.com', 'pugal.g@yahoo.in', 'jan232@gmail.com', 'jacob@hoi.com'],
}
df1 = pd.DataFrame(d)
print(df1)

在此处输入图像描述

df = pd.DataFrame()
for idx, row in df1.iterrows():
    d = df1[df1['email'].str.startswith(row['username'])]
    if not d.empty:
        df = pd.concat([df, d])
df

Using the above code I can filter all the partially matching rows on RIGHT side column (ie email => username )..使用上面的代码,我可以过滤右侧列上所有部分匹配的行(即email => username )..

Current Output:当前 Output:

在此处输入图像描述

But I want the reversed matching as well (ie username => email ), as below但我也想要反向匹配(即username => email ),如下

Expected Output:预期 Output:

在此处输入图像描述

Thanks in advance,提前致谢,

Something like this works.像这样的东西有效。 The reverse task requires you have some minimum condition to match on, in this case, three consecutive matches.反向任务要求您有一些最小条件来匹配,在这种情况下,三个连续的匹配。

Hopefully, this gets you started in the right direction.希望这能让您朝着正确的方向开始。


import pandas as pd

d = {
    'ID': ['1', '4', '5', '9'],
    'username': ['haabi.g', 'pugal.g', 'janani.g', 'hajacob.h'],
    'email': ['abi@gmail.com', 'pugal.g@yahoo.in', 'jan232@gmail.com', 'jacob@hoi.com'],
}
df1 = pd.DataFrame(d)


df1['email_match'] =df1.apply(lambda x: x['email'].startswith(x['username']), axis=1)
df1['user_match'] =df1.apply(lambda x: x['username'].startswith(x['email'][0:3]), axis=1)

print(df1)


  ID   username             email  email_match  user_match
0  1    haabi.g     abi@gmail.com        False       False
1  4    pugal.g  pugal.g@yahoo.in         True        True
2  5   janani.g  jan232@gmail.com        False        True
3  9  hajacob.h     jacob@hoi.com        False       False

You can add a counting mechanism, to know how many of the consecutive values match.您可以添加计数机制,以了解有多少连续值匹配。


def user_match(x):
    name = list(x['email'].split('@')[0])
    user = list(x['username'])
    count = 0
    for t in list(zip(name, user)):
        if t[0] == t[1]:
            count += 1
        if t[0] != t[1]:
            break
    if count >= 3:
        return count
    if count == 0:
        return 0

df1['count'] = df1.apply(lambda x: user_match(x), axis=1)


  ID   username             email  email_match  user_match  count
0  1    haabi.g     abi@gmail.com        False       False      0
1  4    pugal.g  pugal.g@yahoo.in         True        True      7
2  5   janani.g  jan232@gmail.com        False        True      3
3  9  hajacob.h     jacob@hoi.com        False       False      0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM