[英]Pandas: join on partial string match, like Excel VLOOKUP
I am trying to perform an action in Python which is very similar to VLOOKUP in Excel.我正在尝试在 Python 中执行一个与 Excel 中的 VLOOKUP 非常相似的操作。 There have been many questions related to this on StackOverflow but they are all slightly different from this use case.
StackOverflow 上有很多与此相关的问题,但它们都与这个用例略有不同。 Hopefully anyone can guide me in the right direction.
希望任何人都可以指导我正确的方向。 I have the following two pandas dataframes:
我有以下两个熊猫数据框:
df1 = pd.DataFrame({'Invoice': ['20561', '20562', '20563', '20564'],
'Currency': ['EUR', 'EUR', 'EUR', 'USD']})
df2 = pd.DataFrame({'Ref': ['20561', 'INV20562', 'INV20563BG', '20564'],
'Type': ['01', '03', '04', '02'],
'Amount': ['150', '175', '160', '180'],
'Comment': ['bla', 'bla', 'bla', 'bla']})
print(df1)
Invoice Currency
0 20561 EUR
1 20562 EUR
2 20563 EUR
3 20564 USD
print(df2)
Ref Type Amount Comment
0 20561 01 150 bla
1 INV20562 03 175 bla
2 INV20563BG 04 160 bla
3 20564 02 180 bla
Now I would like to create a new dataframe (df3) where I combine the two based on the invoice numbers.现在我想创建一个新的数据框(df3),根据发票编号将两者结合起来。 The problem is that the invoice numbers are not always a "full match", but sometimes a "partial match" in df2['Ref'].
问题是发票号码并不总是“完全匹配”,但有时 df2['Ref'] 中的“部分匹配”。 So the joining on 'Invoice' does not give the desired output because it doesn't copy the data for invoices 20562 & 20563, see below:
因此,加入“发票”并没有提供所需的输出,因为它没有复制发票 20562 和 20563 的数据,见下文:
df3 = df1.join(df2.set_index('Ref'), on='Invoice')
print(df3)
Invoice Currency Type Amount Comment
0 20561 EUR 01 150 bla
1 20562 EUR NaN NaN NaN
2 20563 EUR NaN NaN NaN
3 20564 USD 02 180 bla
Is there a way to join on a partial match?有没有办法加入部分比赛? I know how to "clean" df2['Ref'] with regex, but that is not the solution I am after.
我知道如何用正则表达式“清理”df2['Ref'],但这不是我想要的解决方案。 With a for loop, I get a long way but this isn't very Pythonic.
使用 for 循环,我有很长的路要走,但这不是很 Pythonic。
df4 = df1.copy()
for i, row in df1.iterrows():
tmp = df2[df2['Ref'].str.contains(row['Invoice'])]
df4.loc[i, 'Amount'] = tmp['Amount'].values[0]
print(df4)
Invoice Currency Amount
0 20561 EUR 150
1 20562 EUR 175
2 20563 EUR 160
3 20564 USD 180
Can str.contains() somehow be used in a more elegant way? str.contains() 可以以更优雅的方式使用吗? Thank you so much in advance for your help!
非常感谢您的帮助!
This is one way using pd.Series.apply
, which is just a thinly veiled loop.这是使用
pd.Series.apply
的一种方式,它只是一个隐蔽的循环。 A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form. “部分字符串合并”是您正在寻找的,我不确定它是否以矢量化形式存在。
df4 = df1.copy()
def get_amount(x):
return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]
df4['Amount'] = df4['Invoice'].apply(get_amount)
print(df4)
Currency Invoice Amount
0 EUR 20561 150
1 EUR 20562 175
2 EUR 20563 160
3 USD 20564 180
Here are two alternative solutions, both using Pandas' merge
.这里有两个替代解决方案,都使用 Pandas 的
merge
。
# Solution 1 (checking directly if 'Invoice' string is in the 'Ref' string)
df4 = df2.copy()
df4['Invoice'] = [val for idx, val in enumerate(df1['Invoice']) if val in df2['Ref'][idx]]
df_m4 = df1.merge(df4[['Amount', 'Invoice']], on='Invoice')
# Solution 2 (regex)
import re
df5 = df2.copy()
df5['Invoice'] = [re.findall(r'(\d{5})', s)[0] for s in df2['Ref']]
df_m5 = df1.merge(df5[['Amount', 'Invoice']], on='Invoice')
Both df_m4
and df_m5
will print df_m4
和df_m5
都将打印
Currency Invoice Amount
0 EUR 20561 150
1 EUR 20562 175
2 EUR 20563 160
3 USD 20564 180
Note : The regex solution presented assumes that the invoice numbers are always 5 digits and only takes the first of such occurrences.注意:提供的正则表达式解决方案假定发票号码始终为 5 位数字,并且只采用第一个此类事件。 Solution 1 is more robust, as it directly compares the strings.
解决方案 1 更健壮,因为它直接比较字符串。 The regex solution could be improved to be more robust if needed though.
如果需要,可以改进正则表达式解决方案以使其更加健壮。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.