[英]pandas join DF - merge vs. join different semantics
I want to join 2 DF in pandas. 我想加入2 DF大熊猫。 Some columns are int or float, others are categories.
一些列是int或float,另一些是类别。 (not enforcing the same cat code/index for categories from A and B df) Their common columns are a list of float and category columns of size 8.
(对于A和B df中的类别,不执行相同的目录代码/索引)。它们的公共列是大小为8的float和category列的列表。
Joining via 通过加入
df_a.merge(df_b, how='inner'), on=join_columns )
will return no result at all. 不会返回任何结果。 And joining via
并通过
df_a.join(df_b, lsuffix='_l', rsuffix='_r')
Seems to work. 似乎可以工作。
But I am a bit confused why one failed and if I should not cast all columns to object in order to prevent joining by cat codes which could be wrong. 但是我有点困惑为什么为什么会失败,以及如果我不应该将所有列都强制转换为对象以防止通过cat code联接,那可能是错误的。
Ie if left
is chosen as join method for merge
, joined columns will only contain NAN
values. 即,如果选择
left
作为merge
联接方法,则联接的列将仅包含NAN
值。 Unfortunately, I am not really sure how t build an useful minimal example. 不幸的是,我不确定如何建立一个有用的最小示例。
here a sample 这是一个样本
import pandas as pd
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'name': ['A', 'B', 'C', 'D', 'E'],
'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a
raw_data = {
'subject_id': ['1', '2', '3' ],
'name': ['Billy', 'Brian', 'Bran'],
'nationality': ['DE', 'US', 'US'],
'age_group' : [1, 1, 3],
'average_return_per_group' : [1.5, 2.3, 1.4]}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group', 'average_return_per_group'])
df_b.nationality = df_b.nationality.astype('category')
df_b
# some result is joined
df_a.join(df_b, lsuffix='_l', rsuffix='_r')
# this *fails* as only NULL values joined, or nor result for inner join
df_a.merge(df_b, how='left', on=['nationality', 'age_group'])
join
joins per default along the indexes, and merge
along the columns with the same names. join
默认情况下沿索引联接,并沿相同名称的列merge
。
Check this: 检查一下:
In [115]: df_a.join(df_b, lsuffix='_l', rsuffix='_r')
Out[115]:
subject_id_l name_l nationality_l age_group_l subject_id_r name_r nationality_r age_group_r average_returns_per_group
0 1 A DE 1 1 Billy DE 1.0 NaN
1 2 B AUT 2 2 Brian US 1.0 NaN
2 3 C US 1 3 Bran US 3.0 NaN
3 4 D US 3 NaN NaN NaN NaN NaN
4 5 E US 1 NaN NaN NaN NaN NaN
let's set ['a','b','c']
as an index in df_b
and try to join it again - you'll see only NaN
's in all *_r
columns: 让我们将
['a','b','c']
为df_b
的索引,然后尝试再次加入它-您将在所有*_r
列中仅看到NaN
:
In [116]: df_a.join(df_b.set_index(pd.Index(['a','b','c'])), lsuffix='_l', rsuffix='_r')
Out[116]:
subject_id_l name_l nationality_l age_group_l subject_id_r name_r nationality_r age_group_r average_returns_per_group
0 1 A DE 1 NaN NaN NaN NaN NaN
1 2 B AUT 2 NaN NaN NaN NaN NaN
2 3 C US 1 NaN NaN NaN NaN NaN
3 4 D US 3 NaN NaN NaN NaN NaN
4 5 E US 1 NaN NaN NaN NaN NaN
In [117]: df_b.set_index(pd.Index(['a','b','c']))
Out[117]:
subject_id name nationality age_group average_returns_per_group
a 1 Billy DE 1 NaN
b 2 Brian US 1 NaN
c 3 Bran US 3 NaN
UPDATE: IMO merge works as expected (described in docs) 更新: IMO 合并按预期方式工作(在文档中进行了描述)
In [151]: df_a.merge(df_b, on=['nationality', 'age_group'], how='left', suffixes=['_l','_r'])
Out[151]:
subject_id_l name_l nationality age_group subject_id_r name_r average_return_per_group
0 1 A DE 1 1 Billy 1.5
1 2 B AUT 2 NaN NaN NaN
2 3 C US 1 2 Brian 2.3
3 4 D US 3 3 Bran 1.4
4 5 E US 1 2 Brian 2.3
我认为主要区别是join
具有默认的left join
和merge
inner join.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.