简体   繁体   English

Pandas:加入实际匹配(如 VLOOKUP),但按特定顺序

[英]Pandas: Join with pratial match (like VLOOKUP) but in certain order

I am trying to perform an action in Python which is very similar to VLOOKUP in Excel.我正在尝试在 Python 中执行一个与 Excel 中的 VLOOKUP 非常相似的操作。 but based on the first part of a string, problem is that firt part is not of a certain lenghts.但基于字符串的第一部分,问题是第一部分不是一定的长度。

Ex: I have refrence data of Gouna and GreenLand, but lookupvalue for Gouna sometimes starts with G and other times starts with Gou and for lookupvalues for GreenLand starts with Gre例如:我有 Gouna 和 GreenLand 的参考数据,但Gouna的查找值有时以G开头,有时以Gou开头,而GreenLand的查找值以Gre开头

I have the following two pandas dataframes:我有以下两个熊猫数据框:

df1 = pd.DataFrame({'Abb': ['G', 'GRE', 'Gou', 'B'],
                    'FullName': ['Gouna', 'GreenLand', 'Gouna', 'Bahr']})

df2 = pd.DataFrame({'OrderNo': ['INV20561', 'INV20562', 'INV20563', 'INV20564'],
                    'AreaName': ['GRE65335', 'Gou6D654', 'Gddd654', 'B65465']})


print(df1)

   Abb   FullName
0    G      Gouna
1  GRE  GreenLand
2  Gou      Gouna
3    B    Bahrain

print(df2)

    OrderNo  AreaName
0  INV20561  GRE65335
1  INV20562  Gou6D654
2  INV20563   Gddd654
3  INV20564    B65465

and my needed out put should be:我需要的输出应该是:

    OrderNo     AreaName    FullName
0   INV20561    GRE65335    GreenLand
1   INV20562    Gou6D654    Gouna
2   INV20563    Gddd654     Gouna
3   INV20564    B65465      Bahr

My approach would be to sort the Abb values in the df1 descendingly by values length:我的方法是按值长度对df1中的Abb值进行降序排序:

df1.sort_values(by="Abb", key=lambda x: x.str.len(), ascending=False)

    Abb FullName
1   GRE GreenLand
2   Gou Gouna
0   G   Gouna
3   B   Bahrain

Then perform some sort with vlookup with for loop instead of or applying a custom function.然后使用带有 for 循环的 vlookup 执行某种排序,而不是或应用自定义函数。 and here is where I am stuck.这就是我卡住的地方。

You can craft a regex to extract the country Abb, then use this as a merging key:您可以制作一个正则表达式来提取国家 Abb,然后将其用作合并键:

# we need to sort the Abb by decreasing length to ensure
# specific Abb match before more generic (e.g. Gou/GRE match before G)
regex = '|'.join(df1['Abb'].sort_values(key=lambda s: s.str.len(),
                                        ascending=False)
                 )
# 'GRE|Gou|G|B'

out = df2.merge(df1, right_on='Abb',
                left_on=df2['AreaName'].str.extract(f'^({regex})', expand=False)
                )

If case does not matter:如果大小写无关紧要:

key = df1['Abb'].str.lower()
regex = '|'.join(key
                 .sort_values(key=lambda s: s.str.len(), ascending=False)
                 )
# 'gre|gou|g|b'

out = df2.merge(df1, right_on=key,
                left_on=df2['AreaName']
                        .str.lower()
                        .str.extract(f'^({regex})', expand=False)
                ).drop(columns='key_0')

output:输出:

    OrderNo  AreaName  Abb   FullName
0  INV20561  GRE65335  GRE  GreenLand
1  INV20562  Gou6D654  Gou      Gouna
2  INV20563   Gddd654    G      Gouna
3  INV20564    B65465    B       Bahr

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM