Pandas：基于公共列连接两个数据帧的最佳方法

Question

I know this is a basic question.我知道这是一个基本问题。 But, please hear me out.但是，请听我说完。

I have below dataframes:我有以下数据框：

In [722]: m1
Out[722]: 
   Person_id  Evidence_14 Feature_14
0        100         90.0       True
1        101          NaN        NaN
2        102         91.0       True
3        103          NaN        NaN
4        104         94.0       True
5        105          NaN        NaN
6        106          NaN        NaN

In [721]: m3
Out[721]: 
   Person_id  Evidence_14 Feature_14
0        100          NaN        NaN
1        101         99.0      False
2        102          NaN        NaN
3        103         95.0      False
4        104          NaN        NaN
5        105          NaN        NaN
6        106         93.0      False

Expected Output:预期 Output：

In [734]: z
Out[734]: 
   Person_id  Evidence_14 Feature_14
0        100         90.0       True
1        101         99.0      False
2        102         91.0       True
3        103         95.0      False
4        104         94.0       True
5        105          NaN        NaN
6        106         93.0      False

I am able to solve this like below:我能够像下面这样解决这个问题：

In [725]: z = m1.merge(m3, on='Person_id')
In [728]: z['Evidence_14'] = z.Evidence_14_x.combine_first(z.Evidence_14_y)
In [731]: z['Feature_14'] = z.Feature_14_x.combine_first(z.Feature_14_y)
In [733]: z.drop(['Evidence_14_x', 'Evidence_14_y', 'Feature_14_x', 'Feature_14_y'], 1, inplace=True)

In [734]: z
Out[734]: 
   Person_id  Evidence_14 Feature_14
0        100         90.0       True
1        101         99.0      False
2        102         91.0       True
3        103         95.0      False
4        104         94.0       True
5        105          NaN        NaN
6        106         93.0      False

But, is there a cleaner/better way to do this?但是，有没有更清洁/更好的方法来做到这一点？ Am I missing something very obvious?我错过了一些非常明显的东西吗？

Answer 1

If columns names matching and need match by Person_id values use:如果列名称匹配并且需要通过Person_id值匹配，请使用：

m = m1.set_index('Person_id').combine_first(m2.set_index('Person_id')).reset_index()

If index values are same and also Person_id are same in both DataFrames solution should be simplify by matching with original index values:如果两个 DataFrames 解决方案中的索引值相同并且Person_id相同，则应通过与原始索引值匹配来简化：

m = m1.combine_first(m2)

Answer 2

As Person_id uniquely define related rows in m1 and m3, you have to use set_index.由于 Person_id 唯一定义了 m1 和 m3 中的相关行，因此您必须使用 set_index。 Look at this:看这个：

import pandas as pd

df1 = pd.DataFrame({'id':[11, 22, 33,44],'A': [None, 0, 17, None], 'B': [None, 4, 19,None]})
df2 = pd.DataFrame({'id':[111, 222], 'A': [9999, 9999], 'B': [7777, 7777]})

# df1 = df1.set_index('id')
# df2 = df2.set_index('id')

df1.combine_first(df2)


Out[32]: 
   id       A       B
0  11  9999.0  7777.0
1  22     0.0     4.0
2  33    17.0    19.0
3  44     NaN     NaN

if you dont use set_index the first value of A will be changed even if it's id is 11 in df1 and 111 in df2 (different id)如果您不使用 set_index ，则 A 的第一个值将被更改，即使它的 id 为 df1 中的 11 和 df2 中的 111 （不同的 id）

Also note that if you use set_index, a non existing id in m1 will be Added to the result.另请注意，如果您使用 set_index，m1 中不存在的 id 将被添加到结果中。

Pandas：基于公共列连接两个数据帧的最佳方法

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-12-14 09:48:16

解决方案2
0 2020-12-14 10:36:04

Pandas：基于公共列连接两个数据帧的最佳方法

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-12-14 09:48:16

解决方案2 0 2020-12-14 10:36:04

解决方案1
3 已采纳 2020-12-14 09:48:16

解决方案2
0 2020-12-14 10:36:04