简体   繁体   English

大熊猫加入DF-合并与加入不同的语义

[英]pandas join DF - merge vs. join different semantics

I want to join 2 DF in pandas. 我想加入2 DF大熊猫。 Some columns are int or float, others are categories. 一些列是int或float,另一些是类别。 (not enforcing the same cat code/index for categories from A and B df) Their common columns are a list of float and category columns of size 8. (对于A和B df中的类别,不执行相同的目录代码/索引)。它们的公共列是大小为8的float和category列的列表。

Joining via 通过加入

df_a.merge(df_b, how='inner'), on=join_columns )

will return no result at all. 不会返回任何结果。 And joining via 并通过

df_a.join(df_b, lsuffix='_l', rsuffix='_r')

Seems to work. 似乎可以工作。

But I am a bit confused why one failed and if I should not cast all columns to object in order to prevent joining by cat codes which could be wrong. 但是我有点困惑为什么为什么会失败,以及如果我不应该将所有列都强制转换为对象以防止通过cat code联接,那可能是错误的。

Ie if left is chosen as join method for merge , joined columns will only contain NAN values. 即,如果选择left作为merge联接方法,则联接的列将仅包含NAN值。 Unfortunately, I am not really sure how t build an useful minimal example. 不幸的是,我不确定如何建立一个有用的最小示例。

edit 编辑

here a sample 这是一个样本

import pandas as pd

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'name': ['A', 'B', 'C', 'D', 'E'],
        'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
        'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a


raw_data = {
        'subject_id': ['1', '2', '3' ],
        'name': ['Billy', 'Brian', 'Bran'],
        'nationality': ['DE', 'US', 'US'],
        'age_group' : [1, 1, 3],
        'average_return_per_group' : [1.5, 2.3, 1.4]}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group', 'average_return_per_group'])
df_b.nationality = df_b.nationality.astype('category')
df_b


# some result is joined
df_a.join(df_b, lsuffix='_l', rsuffix='_r') 

# this *fails* as only NULL values joined, or nor result for inner join
df_a.merge(df_b, how='left', on=['nationality', 'age_group'])

join joins per default along the indexes, and merge along the columns with the same names. join默认情况下沿索引联接,并沿相同名称的列merge

Check this: 检查一下:

In [115]: df_a.join(df_b, lsuffix='_l', rsuffix='_r')
Out[115]:
  subject_id_l name_l nationality_l  age_group_l subject_id_r name_r nationality_r  age_group_r average_returns_per_group
0            1      A            DE            1            1  Billy            DE          1.0                       NaN
1            2      B           AUT            2            2  Brian            US          1.0                       NaN
2            3      C            US            1            3   Bran            US          3.0                       NaN
3            4      D            US            3          NaN    NaN           NaN          NaN                       NaN
4            5      E            US            1          NaN    NaN           NaN          NaN                       NaN

let's set ['a','b','c'] as an index in df_b and try to join it again - you'll see only NaN 's in all *_r columns: 让我们将['a','b','c']df_b的索引,然后尝试再次加入它-您将在所有*_r列中仅看到NaN

In [116]: df_a.join(df_b.set_index(pd.Index(['a','b','c'])), lsuffix='_l', rsuffix='_r')
Out[116]:
  subject_id_l name_l nationality_l  age_group_l subject_id_r name_r nationality_r  age_group_r average_returns_per_group
0            1      A            DE            1          NaN    NaN           NaN          NaN                       NaN
1            2      B           AUT            2          NaN    NaN           NaN          NaN                       NaN
2            3      C            US            1          NaN    NaN           NaN          NaN                       NaN
3            4      D            US            3          NaN    NaN           NaN          NaN                       NaN
4            5      E            US            1          NaN    NaN           NaN          NaN                       NaN

In [117]: df_b.set_index(pd.Index(['a','b','c']))
Out[117]:
  subject_id   name nationality  age_group average_returns_per_group
a          1  Billy          DE          1                       NaN
b          2  Brian          US          1                       NaN
c          3   Bran          US          3                       NaN

UPDATE: IMO merge works as expected (described in docs) 更新: IMO 合并按预期方式工作(在文档中进行了描述)

In [151]: df_a.merge(df_b, on=['nationality', 'age_group'], how='left', suffixes=['_l','_r'])
Out[151]:
  subject_id_l name_l nationality  age_group subject_id_r name_r  average_return_per_group
0            1      A          DE          1            1  Billy                       1.5
1            2      B         AUT          2          NaN    NaN                       NaN
2            3      C          US          1            2  Brian                       2.3
3            4      D          US          3            3   Bran                       1.4
4            5      E          US          1            2  Brian                       2.3

我认为主要区别是join具有默认的left joinmerge inner join.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM