简体   繁体   English

Python Pandas:比较两个数据帧的值与ID或行的不同列名

[英]Python Pandas: Compare values of two dataframes with different columns names by ID or row

I have two dataframes (let's name them M and K) that come from different sources. 我有两个来自不同来源的数据帧(我们将它们命名为M和K)。 They have different columns names and the only one column that is the same in both dataframes is ID column (M[id] == K[id]). 它们具有不同的列名称,并且在两个数据帧中唯一一列相同的是ID列(M [id] == K [id])。

A number of rows in both dataframes are equal; 两个数据帧中的行数相等; a number of columns are different. 许多列是不同的。

The goal is to create a matrix which will how many columns have the same values for the same ID (or row). 目标是创建一个矩阵,其中有多少列具有相同ID(或行)的相同值。 The size of the matrix (MK) is M.columns X K.columns. 矩阵(MK)的大小是M.columns X K.columns。 Each cell is store count of matched values for the pair of M.column and K.column. 每个单元格是M.column和K.column对的匹配值的存储计数。 Tha maximum number in the cell is the count of rows for M or K, as they are the same. 单元格中的最大数量是M或K的行数,因为它们是相同的。 Missing values (NaN) should be ignored. 应忽略缺失值(NaN)。

Let talk in figures =) 让我们谈谈数字=)

data_M = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
        'm1': ['a', 'b', 'c', 'd', 'e', 2],
        'm2': [1, 2, 3, 4, np.nan, 1],
        'm3': ['aa','b','cc','d','ff', 3],
        'm4': [4, 6, 3, 4, np.nan, 2],
        'm5': ['b', 6, 'a', 4, np.nan, 1],
        }
data_K = {'id': ['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
        'k1': ['z', 'bb', 'c', 'd', 'e', 4],
        'k2': [1, 2, 32, 5, np.nan, 1],
        'k3': ['aa','b','cc','d','ff', 1],
        'k4': [4, 2, 2, 4, np.nan, 4],
        'k5': [4, 1, 'as', 4, np.nan, 2],
        'k6': ['aa', 1, 'a', 3, np.nan, 2],
        }
M = pd.DataFrame(data_M, columns = ['id','m1','m2','m3','m4','m5']) 
K = pd.DataFrame(data_K, columns = ['id','k1','k2','k3','k4', 'k5','k6'])

M and K output M和K输出

M
Out[2]: 
    id m1   m2  m3   m4   m5
0  id1  a  1.0  aa  4.0    b
1  id2  b  2.0   b  6.0    6
2  id3  c  3.0  cc  3.0    a
3  id4  d  4.0   d  4.0    4
4  id5  e  NaN  ff  NaN  NaN
5  id6  2  1.0   3  2.0    1

K
Out[3]: 
    id  k1    k2  k3   k4   k5   k6
0  id1   z   1.0  aa  4.0    4   aa
1  id2  bb   2.0   b  2.0    1    1
2  id3   c  32.0  cc  2.0   as    a
3  id4   d   5.0   d  4.0    4    3
4  id5   e   NaN  ff  NaN  NaN  NaN
5  id6   4   1.0   1  4.0    2    2

Afte the first compare for id=='id1' the MK matrix should look something like this: 在对id =='id1'进行第一次比较后,MK矩阵看起来像这样:

    id  m1  m2  m3  m4  m5
id  1   0   0   0   0   0
k1  0   0   0   0   0   0
k2  0   0   1   0   0   0
k3  0   0   0   1   0   0
k4  0   0   0   0   1   0
k5  0   0   0   0   1   0
k6  0   0   0   1   0   0

On the second one (id=='id2') it should be next: 在第二个(id =='id2'),它应该是下一个:

    id  m1  m2  m3  m4  m5
id  2   0   0   0   0   0
k1  0   0   0   0   0   0
k2  0   0   2   0   0   0
k3  0   0   0   2   0   0
k4  0   0   1   0   1   0
k5  0   0   0   0   1   0
k6  0   0   0   1   0   0

At the very end, each cell will be transformed to the percentage of matched values. 最后,每个单元格将转换为匹配值的百分比。

And the last one. 最后一个。 Theoretically, it could be more that one row for each ID. 从理论上讲,每个ID可能只有一行。 However, it is not the case for the current issue. 但是,目前的问题并非如此。 But if you have inspiration, you are welcome to solve the 'general case' ^_^ 但如果你有灵感,欢迎你解决'一般情况'^ _ ^

Many thanks. 非常感谢。

Approach using numpy broadcasting and pd.Panel 使用numpy广播和pd.Panel

m = M.values[:, 1:]
k = K.values[:, 1:]

p = pd.Panel(
    (m[:, None] == k[:, :, None]).astype(np.uint8),
    M.id.values, K.columns[1:], M.columns[1:])

then access for each id 然后访问每个id

p['id1']

    m1  m2  m3  m4  m5
k1   0   0   0   0   0
k2   0   1   0   0   0
k3   0   0   1   0   0
k4   0   0   0   1   0
k5   0   0   0   1   0
k6   0   0   1   0   0

Or using pandas groupby 或者使用pandas groupby

df = M.set_index('id').join(K.set_index('id'))

def row_comp(r):
    m = r.filter(like='m')
    k = r.filter(like='k')
    return pd.DataFrame(
        (m.values == k.values.T).astype(np.uint8),
        k.columns, m.columns
    )


df.groupby(level=0).apply(row_comp)

        m1  m2  m3  m4  m5
id                        
id1 k1   0   0   0   0   0
    k2   0   1   0   0   0
    k3   0   0   1   0   0
    k4   0   0   0   1   0
    k5   0   0   0   1   0
    k6   0   0   1   0   0
id2 k1   0   0   0   0   0
    k2   0   1   0   0   0
    k3   1   0   1   0   0
    k4   0   1   0   0   0
    k5   0   0   0   0   0
    k6   0   0   0   0   0
id3 k1   1   0   0   0   0
    k2   0   0   0   0   0
    k3   0   0   1   0   0
    k4   0   0   0   0   0
    k5   0   0   0   0   0
    k6   0   0   0   0   1
id4 k1   1   0   1   0   0
    k2   0   0   0   0   0
    k3   1   0   1   0   0
    k4   0   1   0   1   1
    k5   0   1   0   1   1
    k6   0   0   0   0   0
id5 k1   1   0   0   0   0
    k2   0   0   0   0   0
    k3   0   0   1   0   0
    k4   0   0   0   0   0
    k5   0   0   0   0   1
    k6   0   0   0   0   1
id6 k1   0   0   0   0   0
    k2   0   1   0   0   1
    k3   0   1   0   0   1
    k4   0   0   0   0   0
    k5   1   0   0   1   0
    k6   1   0   0   1   0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 按行比较两个不同数据帧的列 - Pandas - Compare columns of two different dataframes row-wise - Pandas 比较来自两个不同数据框熊猫的列 - Compare columns from two different dataframes pandas 在 Python 中比较两个 DataFrame 的列(每行) - Compare columns (per row) of two DataFrames in Python 检查具有相同名称但在两个不同数据框中的列下的名称是否匹配的最佳代码是什么? 在python中,使用熊猫? - What is the best code for checking if the names under columns with the same names, but in two different dataframes match? In python, using pandas? 比较来自不同Pandas数据框的列,并替换其值<Pandas, Python> - Compare columns from different Pandas dataframes, and replace its values <Pandas, Python> pandas:比较来自两个不同大小的不同数据帧的字符串列 - pandas: compare string columns from two different dataframes of different sizes 逐行比较两个单独的熊猫数据帧并返回匹配值 - compare two seperate pandas dataframes row by row and return matching values 比较不同pandas数据帧中的列 - Compare columns in different pandas dataframes Python / Pandas:比较两个数据框中的多列,如果找不到匹配项,则删除行 - Python/Pandas: Compare multiple columns in two dataframes and remove row if no matches found 比较多个列以获取两个Pandas Dataframe中不同的行 - Compare Multiple Columns to Get Rows that are Different in Two Pandas Dataframes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM