简体   繁体   English

理解pandas merge中的“left_index”和“right_index”参数

[英]Understanding the “left_index” and “right_index” arguments in pandas merge

I am really struggling to understand the "left_index" and "right_index" arguments in pandas.merge. 我真的很难理解pandas.merge中的“left_index”和“right_index”参数。 I read the documentation, searched around, experimented with various setting and tried to understand but I am still confused. 我阅读了文档,搜索了周围,尝试了各种设置并尝试理解,但我仍然感到困惑。 Consider this example: 考虑这个例子:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'], 
                 'key2': ['K0', 'K1', 'K0', 'K1'],
                 'A': ['A0', 'A1', 'A2', 'A3'],
                 'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3'],
                      'E': [1,2,3,4]})

Now, when I run the following command: 现在,当我运行以下命令时:

pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, left_index=True)

I get: 我明白了:

  key1_x key2_x    A    B key1_y key2_y    C    D    E      _merge
0     K0     K0   A0   B0     K0     K0   C0   D0  1.0        both
1     K0     K1   A1   B1     K1     K0   C1   D1  2.0        both
2     K0     K1   A1   B1     K1     K0   C2   D2  3.0        both
3     K1     K0   A2   B2    NaN    NaN  NaN  NaN  NaN   left_only
3     K2     K1   A3   B3    NaN    NaN  NaN  NaN  NaN   left_only
3    NaN    NaN  NaN  NaN     K2     K0   C3   D3  4.0  right_only

However, running the same with right_index=True gives an error. 但是,使用right_index=True运行相同会产生错误。 Same if I introduce both. 如果我介绍两者,也一样。 More interestingly, running the following merge gives a very unexpected result 更有趣的是,运行以下合并会产生非常意外的结果

pd.merge(left, right,  on=['key1', 'key2'],how='outer', validate = 'one_to_many', indicator=True, left_index = True, right_index = True)

Result is: 结果是:

  key1 key2   A   B   C   D  E _merge
0   K0   K0  A0  B0  C0  D0  1   both
1   K0   K1  A1  B1  C1  D1  2   both
2   K1   K0  A2  B2  C2  D2  3   both
3   K2   K1  A3  B3  C3  D3  4   both

As you can see, all information for right frame for key1 and key2 is completely lost. 如您所见, key1key2右框架的所有信息都完全丢失。

Please help me understand the purpose and function of these arguments. 请帮助我理解这些论点的目的和功能。 Thank you. 谢谢。

Merging happens in a couple of ways: 合并发生在几个方面:

Column-Column Merge: Use left_on, right_on and how. 列 - 列合并:使用left_on,right_on和how。

Example: 例:

# Gives same answer
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how = 'outer')
pd.merge(left, right, on=['key1', 'key2'], how='outer', indicator=True)

Index-Index Merge: Set left_index and right_index to True or use on and use how. 索引索引合并:将left_index和right_index设置为True或使用on并使用方式。

Example: 例:

pd.merge(left, right, how = 'inner', right_index = True, left_index = True)
# If you make matching unique multi-indexes for both data frames you can do
# pd.merge(left, right, how = 'inner', on = ['indexname1', 'indexname2'])
# In your data frames, you're keys duplicate values so you can't do this
# In general, a column with duplicate values does not make a good key

Column-Index Merge: Use left_on + right_index or left_index + right_on and how. Column-Index Merge:使用left_on + right_index或left_index + right_on以及如何使用。

Note: Both the values in index and left_on must match. 注意:index和left_on中的值必须匹配。 If you're index is a integer and you're left_on is a string, you get error. 如果你的索引是一个整数而你是left_on是一个字符串,你会得到错误。 Also, number of indexing levels must match. 此外,索引级别的数量必须匹配。

Example: 例:

# If how not specified, inner join is used
pd.merge(left, right, right_on=['E'], left_index = True, how = 'outer')  

# Gives error because left_on is string and right_index is integer
pd.merge(left, right, left_on=['key1'], right_index = True, how = 'outer')

# This gave you error because left_on has indexing level of 2 but right_index only has indexing level of 1.
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, right_index=True)

You kind of mix up the different types of merges which gave weird results. 你有点混淆不同类型的合并,这给出了奇怪的结果。 If you can't see how the merging is going to happen conceptually, chances are a computer isn't going to do any better. 如果您无法从概念上看到合并将如何发生,那么计算机可能无法做得更好。

If I understand the behavior of merge correctly, you should pick only one option for left and right respectively (ie You should not pick left_on=['x'] and left_index=True at the same time). 如果我理解的行为merge正确,你应该选择只有一个选项的leftright分别为(即你不应该选择left_on=['x']left_index=True在同一时间)。 Otherwise, strange thing can happen in arbitrary way since it confuses merge as to which key should be actually used as you have shown in current implementation of merge (I have not checked the pandas source in detail, but the behavior can change for different implementations in each version). 否则,奇怪的事情可能会以任意方式发生,因为它混淆了merge应该实际使用哪个key就像你在当前的merge实现中所显示的那样(我没有详细检查过pandas源,但是行为可以改变为不同的实现每个版本)。 Here is a small experiment. 这是一个小实验。

>>> left
  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3

>>> right
  key1 key2   C   D  E
0   K0   K0  C0  D0  1
1   K1   K0  C1  D1  2
2   K1   K0  C2  D2  3
3   K2   K0  C3  D3  4

(1) merge using ['key1', 'key2'] (1)使用['key1', 'key2'] merge

>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')

  key1 key2    A    B    C    D    E
0   K0   K0   A0   B0   C0   D0  1.0
1   K0   K1   A1   B1  NaN  NaN  NaN
2   K1   K0   A2   B2   C1   D1  2.0
3   K1   K0   A2   B2   C2   D2  3.0
4   K2   K1   A3   B3  NaN  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3  4.0

(2) Set ['key1', 'key2'] as left index and merge it using the index and keys (2)将['key1', 'key2']left索引,并使用索引和键merge

>>> left = left.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_on=['key1', 'key2'], how='outer').reset_index(drop=True)

     A    B key1 key2    C    D    E
0   A0   B0   K0   K0   C0   D0  1.0
1   A1   B1   K0   K1  NaN  NaN  NaN
2   A2   B2   K1   K0   C1   D1  2.0
3   A2   B2   K1   K0   C2   D2  3.0
4   A3   B3   K2   K1  NaN  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3  4.0

(3) Further set ['key1', 'key2'] as right index and merge it using the index (3)进一步将['key1', 'key2']right索引并使用索引merge

>>> right = right.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_index=True, how='outer').reset_index()

  key1 key2    A    B    C    D    E
0   K0   K0   A0   B0   C0   D0  1.0
1   K0   K1   A1   B1  NaN  NaN  NaN
2   K1   K0   A2   B2   C1   D1  2.0
3   K1   K0   A2   B2   C2   D2  3.0
4   K2   K0  NaN  NaN   C3   D3  4.0
5   K2   K1   A3   B3  NaN  NaN  NaN

Please note that (1)(2)(3) above are showing the same results, and even if ['key1', 'key2'] are set as index, you can still use left_on = ['key1', 'key2'] instead of left_index=True . 请注意,上面的(1)(2)(3)显示相同的结果,即使将['key1', 'key2']设置为索引,您仍然可以使用left_on = ['key1', 'key2']代替left_index=True

Now, if you really want to merge using both ['key1', 'key2'] with index , one way to achieve this is: 现在,如果你真的想用['key1', 'key2']index合并,实现这个的一种方法是:

>>> pd.merge(left.reset_index(), right.reset_index(), on=['index', 'key1', 'key2'], how='outer')

   index key1 key2    A    B    C    D    E
0      0   K0   K0   A0   B0   C0   D0  1.0
1      1   K0   K1   A1   B1  NaN  NaN  NaN
2      2   K1   K0   A2   B2   C2   D2  3.0
3      3   K2   K1   A3   B3  NaN  NaN  NaN
4      1   K1   K0  NaN  NaN   C1   D1  2.0
5      3   K2   K0  NaN  NaN   C3   D3  4.0

If you read down to here, I'm pretty sure now you know how to achieve above using multiple different ways. 如果你读到这里,我很确定你现在知道如何使用多种不同方式实现上述目标。 Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM