[英]Python: How to merge two dataframe using multiple columns as keys
I am searching something the equivalent of a SQL merge using where "t1.A = t2.A OR t1.B = t2.A" OR t1.C = t2.A. 我正在搜索相当于SQL合并的东西,其中“t1.A = t2.A OR t1.B = t2.A”或t1.C = t2.A. I have two data frames say D1 with A, B, C, D, E as columns and D2, where few records of D2 can be pulled by A column of D1, few are from its alias B, C, D and E columns. 我有两个数据帧,D1表示A,B,C,D,E作为列和D2,其中D2的A列很少可以从D1的A列中拉出,很少来自其别名B,C,D和E列。
I tried as below but it was giving me the wrong output. 我尝试如下,但它给了我错误的输出。
sample = D1.merge(D2,left_on=[ 'A' or'B' or'C'or 'D' or E],
right_on=['A'], how='left')
Then I tried 然后我试了一下
sample = pd.concat([D1.merge(D2,left_on='A', right_on= 'A', how='left'),
D1.merge(D2,left_on='B', right_on='A', how='left'), D1.merge(D2,
left_on='C',right_on='A', how='left'),D1.merge(D2,left_on='D',
right_on='A', how='left'),D1.merge(D2,left_on='E', right_on='A',
how='left')])
This is giving me a lot of duplicates I tried to remove duplicate but unfortunately, it didn't work out. 这给了我很多重复,我试图删除重复但不幸的是,它没有成功。
dupes = (sample['A'] == sample['B']) == (sample['C'] == sample['D']) ==
sample['E']
sample=sample.loc[~dupes]
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I need the output or 'sample' records to be same as records of data frame D1. 我需要输出或'sample'记录与数据帧D1的记录相同。
Let's start from import itertools
(we will use it). 让我们从import itertools
开始(我们将使用它)。
I created the test DataFrames as follows: 我创建了测试DataFrames如下:
D1 = pd.DataFrame(data=[
[ 1, 0, 0, 0, 0, 91 ],
[ 0, 2, 0, 0, 0, 92 ],
[ 0, 0, 3, 0, 0, 93 ],
[ 0, 0, 0, 4, 0, 94 ],
[ 0, 0, 0, 0, 5, 95 ],
[ 0, 6, 0, 0, 0, 96 ],
[ 0, 0, 7, 0, 0, 97 ]], columns=list('ABCDEF'))
D2 = pd.DataFrame(data=[
[ 1, 71, 89 ],
[ 2, 72, 88 ],
[ 3, 73, 87 ],
[ 4, 74, 86 ],
[ 5, 75, 85 ],
[ 8, 76, 84 ]], columns=list('AXY'))
As you can see: 如你看到的:
Then let's define the join function: 然后让我们定义连接函数:
def myJoin(df1, df2):
rows = itertools.product(df1.iterrows(), df2.iterrows())
df = pd.DataFrame(left.append(right.iloc[1:])
for (_, left), (_, right) in rows
if right.A in left.loc['A':'E'].tolist())
return df.reset_index(drop=True)
And the only thing to do is to call it: 唯一要做的就是称之为:
myJoin(D1, D2)
The result is: 结果是:
A B C D E F X Y
0 1 0 0 0 0 91 71 89
1 0 2 0 0 0 92 72 88
2 0 0 3 0 0 93 73 87
3 0 0 0 4 0 94 74 86
4 0 0 0 0 5 95 75 85
Note that column names taken from both DataFrames should be unique , so I eliminated A column from D2 ( right.iloc[1:] ). 请注意,从两个DataFrame中获取的列名称应该是唯一的 ,因此我从D2中删除了一个列( right.iloc [1:] )。
The function presented above does actually inner join. 上面介绍的函数实际上是内连接。 If you want left join, then define another join function as: 如果您想要左连接,则将另一个连接函数定义为:
def myJoin2(df1, df2):
res = []
for (_, left) in df1.iterrows():
found = False
for (_, right) in df2.iterrows():
if right.A in left.loc['A':'E'].tolist():
res.append(left.append(right.iloc[1:]))
found = True
if not found:
res.append(left)
df = pd.DataFrame(res)
return df.reset_index(drop=True)
and call it: 并称之为:
myJoin2(D1, D2)
getting the result: 得到结果:
A B C D E F X Y
0 1.0 0.0 0.0 0.0 0.0 91.0 71.0 89.0
1 0.0 2.0 0.0 0.0 0.0 92.0 72.0 88.0
2 0.0 0.0 3.0 0.0 0.0 93.0 73.0 87.0
3 0.0 0.0 0.0 4.0 0.0 94.0 74.0 86.0
4 0.0 0.0 0.0 0.0 5.0 95.0 75.0 85.0
5 0.0 0.0 0.0 0.0 5.0 95.0 76.0 84.0
6 0.0 6.0 0.0 0.0 0.0 96.0 NaN NaN
7 0.0 0.0 7.0 0.0 0.0 97.0 NaN NaN
The downside is that int values are converted to float , but as NaN is also a special case of float , it can't be avoided. 缺点是int值转换为float ,但由于NaN也是float的特例,因此无法避免。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.