[英]Python Pandas: Compare values of two dataframes with different columns names by ID or row
I have two dataframes (let's name them M and K) that come from different sources. 我有两个来自不同来源的数据帧(我们将它们命名为M和K)。 They have different columns names and the only one column that is the same in both dataframes is ID column (M[id] == K[id]). 它们具有不同的列名称,并且在两个数据帧中唯一一列相同的是ID列(M [id] == K [id])。
A number of rows in both dataframes are equal; 两个数据帧中的行数相等; a number of columns are different. 许多列是不同的。
The goal is to create a matrix which will how many columns have the same values for the same ID (or row). 目标是创建一个矩阵,其中有多少列具有相同ID(或行)的相同值。 The size of the matrix (MK) is M.columns X K.columns. 矩阵(MK)的大小是M.columns X K.columns。 Each cell is store count of matched values for the pair of M.column and K.column. 每个单元格是M.column和K.column对的匹配值的存储计数。 Tha maximum number in the cell is the count of rows for M or K, as they are the same. 单元格中的最大数量是M或K的行数,因为它们是相同的。 Missing values (NaN) should be ignored. 应忽略缺失值(NaN)。
Let talk in figures =) 让我们谈谈数字=)
data_M = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'm1': ['a', 'b', 'c', 'd', 'e', 2],
'm2': [1, 2, 3, 4, np.nan, 1],
'm3': ['aa','b','cc','d','ff', 3],
'm4': [4, 6, 3, 4, np.nan, 2],
'm5': ['b', 6, 'a', 4, np.nan, 1],
}
data_K = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'k1': ['z', 'bb', 'c', 'd', 'e', 4],
'k2': [1, 2, 32, 5, np.nan, 1],
'k3': ['aa','b','cc','d','ff', 1],
'k4': [4, 2, 2, 4, np.nan, 4],
'k5': [4, 1, 'as', 4, np.nan, 2],
'k6': ['aa', 1, 'a', 3, np.nan, 2],
}
M = pd.DataFrame(data_M, columns = ['id','m1','m2','m3','m4','m5'])
K = pd.DataFrame(data_K, columns = ['id','k1','k2','k3','k4', 'k5','k6'])
M and K output M和K输出
M
Out[2]:
id m1 m2 m3 m4 m5
0 id1 a 1.0 aa 4.0 b
1 id2 b 2.0 b 6.0 6
2 id3 c 3.0 cc 3.0 a
3 id4 d 4.0 d 4.0 4
4 id5 e NaN ff NaN NaN
5 id6 2 1.0 3 2.0 1
K
Out[3]:
id k1 k2 k3 k4 k5 k6
0 id1 z 1.0 aa 4.0 4 aa
1 id2 bb 2.0 b 2.0 1 1
2 id3 c 32.0 cc 2.0 as a
3 id4 d 5.0 d 4.0 4 3
4 id5 e NaN ff NaN NaN NaN
5 id6 4 1.0 1 4.0 2 2
Afte the first compare for id=='id1' the MK matrix should look something like this: 在对id =='id1'进行第一次比较后,MK矩阵看起来像这样:
id m1 m2 m3 m4 m5
id 1 0 0 0 0 0
k1 0 0 0 0 0 0
k2 0 0 1 0 0 0
k3 0 0 0 1 0 0
k4 0 0 0 0 1 0
k5 0 0 0 0 1 0
k6 0 0 0 1 0 0
On the second one (id=='id2') it should be next: 在第二个(id =='id2'),它应该是下一个:
id m1 m2 m3 m4 m5
id 2 0 0 0 0 0
k1 0 0 0 0 0 0
k2 0 0 2 0 0 0
k3 0 0 0 2 0 0
k4 0 0 1 0 1 0
k5 0 0 0 0 1 0
k6 0 0 0 1 0 0
At the very end, each cell will be transformed to the percentage of matched values. 最后,每个单元格将转换为匹配值的百分比。
And the last one. 最后一个。 Theoretically, it could be more that one row for each ID. 从理论上讲,每个ID可能只有一行。 However, it is not the case for the current issue. 但是,目前的问题并非如此。 But if you have inspiration, you are welcome to solve the 'general case' ^_^ 但如果你有灵感,欢迎你解决'一般情况'^ _ ^
Many thanks. 非常感谢。
Approach using numpy
broadcasting and pd.Panel
使用numpy
广播和pd.Panel
m = M.values[:, 1:]
k = K.values[:, 1:]
p = pd.Panel(
(m[:, None] == k[:, :, None]).astype(np.uint8),
M.id.values, K.columns[1:], M.columns[1:])
then access for each id 然后访问每个id
p['id1']
m1 m2 m3 m4 m5
k1 0 0 0 0 0
k2 0 1 0 0 0
k3 0 0 1 0 0
k4 0 0 0 1 0
k5 0 0 0 1 0
k6 0 0 1 0 0
Or using pandas
groupby
或者使用pandas
groupby
df = M.set_index('id').join(K.set_index('id'))
def row_comp(r):
m = r.filter(like='m')
k = r.filter(like='k')
return pd.DataFrame(
(m.values == k.values.T).astype(np.uint8),
k.columns, m.columns
)
df.groupby(level=0).apply(row_comp)
m1 m2 m3 m4 m5
id
id1 k1 0 0 0 0 0
k2 0 1 0 0 0
k3 0 0 1 0 0
k4 0 0 0 1 0
k5 0 0 0 1 0
k6 0 0 1 0 0
id2 k1 0 0 0 0 0
k2 0 1 0 0 0
k3 1 0 1 0 0
k4 0 1 0 0 0
k5 0 0 0 0 0
k6 0 0 0 0 0
id3 k1 1 0 0 0 0
k2 0 0 0 0 0
k3 0 0 1 0 0
k4 0 0 0 0 0
k5 0 0 0 0 0
k6 0 0 0 0 1
id4 k1 1 0 1 0 0
k2 0 0 0 0 0
k3 1 0 1 0 0
k4 0 1 0 1 1
k5 0 1 0 1 1
k6 0 0 0 0 0
id5 k1 1 0 0 0 0
k2 0 0 0 0 0
k3 0 0 1 0 0
k4 0 0 0 0 0
k5 0 0 0 0 1
k6 0 0 0 0 1
id6 k1 0 0 0 0 0
k2 0 1 0 0 1
k3 0 1 0 0 1
k4 0 0 0 0 0
k5 1 0 0 1 0
k6 1 0 0 1 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.