[英]Pandas: Best way to join two dataframes based on a common column
I know this is a basic question.我知道这是一个基本问题。 But, please hear me out.
但是,请听我说完。
I have below dataframes:我有以下数据框:
In [722]: m1
Out[722]:
Person_id Evidence_14 Feature_14
0 100 90.0 True
1 101 NaN NaN
2 102 91.0 True
3 103 NaN NaN
4 104 94.0 True
5 105 NaN NaN
6 106 NaN NaN
In [721]: m3
Out[721]:
Person_id Evidence_14 Feature_14
0 100 NaN NaN
1 101 99.0 False
2 102 NaN NaN
3 103 95.0 False
4 104 NaN NaN
5 105 NaN NaN
6 106 93.0 False
Expected Output:预期 Output:
In [734]: z
Out[734]:
Person_id Evidence_14 Feature_14
0 100 90.0 True
1 101 99.0 False
2 102 91.0 True
3 103 95.0 False
4 104 94.0 True
5 105 NaN NaN
6 106 93.0 False
I am able to solve this like below:我能够像下面这样解决这个问题:
In [725]: z = m1.merge(m3, on='Person_id')
In [728]: z['Evidence_14'] = z.Evidence_14_x.combine_first(z.Evidence_14_y)
In [731]: z['Feature_14'] = z.Feature_14_x.combine_first(z.Feature_14_y)
In [733]: z.drop(['Evidence_14_x', 'Evidence_14_y', 'Feature_14_x', 'Feature_14_y'], 1, inplace=True)
In [734]: z
Out[734]:
Person_id Evidence_14 Feature_14
0 100 90.0 True
1 101 99.0 False
2 102 91.0 True
3 103 95.0 False
4 104 94.0 True
5 105 NaN NaN
6 106 93.0 False
But, is there a cleaner/better way to do this?但是,有没有更清洁/更好的方法来做到这一点? Am I missing something very obvious?
我错过了一些非常明显的东西吗?
If columns names matching and need match by Person_id
values use:如果列名称匹配并且需要通过
Person_id
值匹配,请使用:
m = m1.set_index('Person_id').combine_first(m2.set_index('Person_id')).reset_index()
If index values are same and also Person_id
are same in both DataFrames solution should be simplify by matching with original index values:如果两个 DataFrames 解决方案中的索引值相同并且
Person_id
相同,则应通过与原始索引值匹配来简化:
m = m1.combine_first(m2)
As Person_id uniquely define related rows in m1 and m3, you have to use set_index.由于 Person_id 唯一定义了 m1 和 m3 中的相关行,因此您必须使用 set_index。 Look at this:
看这个:
import pandas as pd
df1 = pd.DataFrame({'id':[11, 22, 33,44],'A': [None, 0, 17, None], 'B': [None, 4, 19,None]})
df2 = pd.DataFrame({'id':[111, 222], 'A': [9999, 9999], 'B': [7777, 7777]})
# df1 = df1.set_index('id')
# df2 = df2.set_index('id')
df1.combine_first(df2)
Out[32]:
id A B
0 11 9999.0 7777.0
1 22 0.0 4.0
2 33 17.0 19.0
3 44 NaN NaN
if you dont use set_index the first value of A will be changed even if it's id is 11 in df1 and 111 in df2 (different id)如果您不使用 set_index ,则 A 的第一个值将被更改,即使它的 id 为 df1 中的 11 和 df2 中的 111 (不同的 id)
Also note that if you use set_index, a non existing id in m1 will be Added to the result.另请注意,如果您使用 set_index,m1 中不存在的 id 将被添加到结果中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.