简体   繁体   English

Pandas:加入部分字符串匹配,如 Excel VLOOKUP

[英]Pandas: join on partial string match, like Excel VLOOKUP

I am trying to perform an action in Python which is very similar to VLOOKUP in Excel.我正在尝试在 Python 中执行一个与 Excel 中的 VLOOKUP 非常相似的操作。 There have been many questions related to this on StackOverflow but they are all slightly different from this use case. StackOverflow 上有很多与此相关的问题,但它们都与这个用例略有不同。 Hopefully anyone can guide me in the right direction.希望任何人都可以指导我正确的方向。 I have the following two pandas dataframes:我有以下两个熊猫数据框:

df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
                    'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
                    'Type': ['01', '03', '04', '02'],
                    'Amount': ['150', '175', '160', '180'],
                    'Comment': ['bla', 'bla', 'bla', 'bla']})

print(df1)
    Invoice Currency
0   20561   EUR
1   20562   EUR
2   20563   EUR
3   20564   USD

print(df2)
    Ref         Type    Amount  Comment
0   20561       01      150     bla
1   INV20562    03      175     bla
2   INV20563BG  04      160     bla
3   20564       02      180     bla

Now I would like to create a new dataframe (df3) where I combine the two based on the invoice numbers.现在我想创建一个新的数据框(df3),根据发票编号将两者结合起来。 The problem is that the invoice numbers are not always a "full match", but sometimes a "partial match" in df2['Ref'].问题是发票号码并不总是“完全匹配”,但有时 df2['Ref'] 中的“部分匹配”。 So the joining on 'Invoice' does not give the desired output because it doesn't copy the data for invoices 20562 & 20563, see below:因此,加入“发票”并没有提供所需的输出,因为它没有复制发票 20562 和 20563 的数据,见下文:

df3 = df1.join(df2.set_index('Ref'), on='Invoice')

print(df3)
    Invoice Currency    Type    Amount  Comment
0   20561   EUR         01       150    bla
1   20562   EUR         NaN      NaN    NaN
2   20563   EUR         NaN      NaN    NaN
3   20564   USD         02       180    bla

Is there a way to join on a partial match?有没有办法加入部分比赛? I know how to "clean" df2['Ref'] with regex, but that is not the solution I am after.我知道如何用正则表达式“清理”df2['Ref'],但这不是我想要的解决方案。 With a for loop, I get a long way but this isn't very Pythonic.使用 for 循环,我有很长的路要走,但这不是很 Pythonic。

df4 = df1.copy()
for i, row in df1.iterrows():
    tmp = df2[df2['Ref'].str.contains(row['Invoice'])]
    df4.loc[i, 'Amount'] = tmp['Amount'].values[0]

print(df4)
Invoice     Currency    Amount
0   20561   EUR         150
1   20562   EUR         175
2   20563   EUR         160
3   20564   USD         180

Can str.contains() somehow be used in a more elegant way? str.contains() 可以以更优雅的方式使用吗? Thank you so much in advance for your help!非常感谢您的帮助!

This is one way using pd.Series.apply , which is just a thinly veiled loop.这是使用pd.Series.apply的一种方式,它只是一个隐蔽的循环。 A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form. “部分字符串合并”是您正在寻找的,我不确定它是否以矢量化形式存在。

df4 = df1.copy()

def get_amount(x):
    return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]

df4['Amount'] = df4['Invoice'].apply(get_amount)

print(df4)

  Currency Invoice Amount
0      EUR   20561    150
1      EUR   20562    175
2      EUR   20563    160
3      USD   20564    180

Here are two alternative solutions, both using Pandas' merge .这里有两个替代解决方案,都使用 Pandas 的merge

# Solution 1 (checking directly if 'Invoice' string is in the 'Ref' string)
df4 = df2.copy()
df4['Invoice'] = [val for idx, val in enumerate(df1['Invoice']) if val in df2['Ref'][idx]]
df_m4 = df1.merge(df4[['Amount', 'Invoice']], on='Invoice')

# Solution 2 (regex)
import re
df5 = df2.copy()
df5['Invoice'] = [re.findall(r'(\d{5})', s)[0] for s in df2['Ref']]
df_m5 = df1.merge(df5[['Amount', 'Invoice']], on='Invoice')

Both df_m4 and df_m5 will print df_m4df_m5都将打印

  Currency Invoice Amount
0      EUR   20561    150
1      EUR   20562    175
2      EUR   20563    160
3      USD   20564    180

Note : The regex solution presented assumes that the invoice numbers are always 5 digits and only takes the first of such occurrences.注意:提供的正则表达式解决方案假定发票号码始终为 5 位数字,并且只采用第一个此类事件。 Solution 1 is more robust, as it directly compares the strings.解决方案 1 更健壮,因为它直接比较字符串。 The regex solution could be improved to be more robust if needed though.如果需要,可以改进正则表达式解决方案以使其更加健壮。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM