[英]Merge DataFrames on two columns
This is a follow-up from this question 这是这个问题的后续行动
I have two pandas DataFrames, as follows: 我有两个pandas DataFrame,如下:
print( a )
foo bar let letval
9 foo1 bar1 let1 a
8 foo2 bar2 let1 b
7 foo3 bar3 let1 c
6 foo1 bar1 let2 z
5 foo2 bar2 let2 y
4 foo3 bar3 let2 x
print( b )
foo bar num numval
0 foo1 bar1 num1 1
1 foo2 bar2 num1 2
2 foo3 bar3 num1 3
3 foo1 bar1 num2 4
4 foo2 bar2 num2 5
5 foo3 bar3 num2 6
I want to merge
the two of them on the columns [ 'foo', 'bar' ]
. 我想在列[ 'foo', 'bar' ]
上merge
它们中的两个。
If I simply do c = pd.merge( a, b, on=['foo', 'bar'] )
, I get: 如果我只是做c = pd.merge( a, b, on=['foo', 'bar'] )
,我得到:
prnint( c )
foo bar let letval num numval
0 foo1 bar1 let1 a num1 1
1 foo1 bar1 let1 a num2 4
2 foo1 bar1 let2 z num1 1
3 foo1 bar1 let2 z num2 4
4 foo2 bar2 let1 b num1 2
5 foo2 bar2 let1 b num2 5
6 foo2 bar2 let2 y num1 2
7 foo2 bar2 let2 y num2 5
8 foo3 bar3 let1 c num1 3
9 foo3 bar3 let1 c num2 6
10 foo3 bar3 let2 x num1 3
11 foo3 bar3 let2 x num2 6
I would like: 我想要:
print( c )
foo bar let letval num numval
0 foo1 bar1 let1 a num1 1
1 foo2 bar2 let1 b num1 2
2 foo3 bar3 let1 c num1 3
3 foo1 bar1 let2 z num2 4
4 foo2 bar2 let2 y num2 5
5 foo3 bar3 let2 x num2 6
The closest I've got is: 我最接近的是:
c = pd.merge( a, b, left_index=['foo', 'bar'], right_index=['foo', 'bar'] )
What am I missing? 我错过了什么?
And why do I get c.shape = (12,6)
in the first example? 为什么我在第一个例子中得到c.shape = (12,6)
?
Edit 编辑
Thanks to @piRSquared's answer I realized that the underlying problem is that there is not a single combination of columns to do that. 感谢@ piRSquared的回答,我意识到潜在的问题是没有一个列的组合来做到这一点。 Thus the merge problem, as posed before cannot be univocally solved. 因此,之前提出的合并问题不能单一解决。 That said, the question is converted into a simpler one: 也就是说,问题转化为更简单的问题:
How to make a univocal relationship between the tables? 如何在表之间建立单一的关系?
I solved that with a dictionary that maps the desired outputs that need to be aligned: 我用一本字典来解决这个问题,该字典映射了需要对齐的所需输出:
map_ab = { 'num1':'let1', 'num2':'let2' }
b['let'] = b.apply( lambda x: map_ab[x['num']], axis=1 )
c = pd.merge( a, b, on=['foo', 'bar', 'let'] )
print( c )
The reason you are getting that is because the columns you are merging on do not constitute unique combinations. 您得到的原因是因为您合并的列不构成唯一组合。 For example, The first (index 0) row of a
has foo1
and bar1
, but so does the fourth row (index 3). 例如,所述的第一(索引0)的行a
具有foo1
和bar1
,但这样做的第四行(索引3)。 Ok, that's fine, but b
has the same issue. 好的,没关系,但是b
有同样的问题。 So, when you match up b
's foo1
& bar1
for row indexed with 0
it matches twice. 因此,当你将b
的foo1
和bar1
与用0
索引的行匹配时,它匹配两次。 Same is true when you match foo1
& bar1
in row indexed with 3
, it matches twice. 当您将索引为3
行中的foo1
和bar1
匹配时,情况也是如此,它匹配两次。 So you end up with four matches for those 2 rows. 所以你最终得到了这两行的四场比赛。
So you get 所以你得到了
a
row 0 matches with b
row 0 a
行0用火柴b
行0 a
row 0 matches with b
row 3 a
行0用火柴b
行3 a
row 3 matches with b
row 0 a
行3根用火柴b
行0 a
row 3 matches with b
row 3 a
行3根用火柴b
行3 And THEN, your example does this 2 more times. 然后,你的例子再做2次。 3 * 4 == 12
The only way to do this and be unambiguous is to decide on a rule on which match to take if there are more than one matches. 要做到这一点并且明确无误的唯一方法是决定在有多个匹配项时要采取哪种匹配的规则。 I decided to groupby one of your other columns then take the first one. 我决定将你的其他一个专栏分组,然后选择第一个专栏。 It still doesn't match your expected output but I'm proposing that you gave a bad example. 它仍然与你的预期输出不符,但我建议你给出一个坏的例子。
pd.merge( a, b, on=['foo', 'bar']).groupby(['foo', 'bar', 'let'], as_index=False).first()
you can use combine_first : 你可以使用combine_first :
In[21]:a.combine_first(b)
Out[21]:
bar foo let letval num numval
0 bar1 foo1 let1 a num1 1
1 bar2 foo2 let1 b num1 2
2 bar3 foo3 let1 c num1 3
3 bar1 foo1 let2 z num2 4
4 bar2 foo2 let2 y num2 5
5 bar3 foo3 let2 x num2 6
In the first example you are doing inner join
which returns all rows if bar
& foo
are equal in a,b
. 在第一个示例中,您正在执行inner join
,如果bar
, foo
在a,b
中相等,则返回所有行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.