简体   繁体   English

比较两个数据框的列?

[英]Comparing the columns of two data frames?

I have two data frames as follows:我有两个数据框如下:

  df1:
            id,   f1,   f2,..., f800
            0,     5,  5.6,..,  3,7
            1,   2.4,  1.6,..,  1,7
            2,     3,  2.3,..,  4,4
            ....
            n,   4.7,  9,3,..., 8,2

 df2:
            id,   v1,   v2,..., v200
            0,     5,  5.6,..,  5,7
            1,   2.4,  1.6,..,  6,7
            2,     3,  2.3,..,  4,2
            ....
            n,   4.7,  9,3,..., 3,1

The df1 consists of 800 features and df2 includes only 200 features. df1 包含 800 个特征,而 df2 仅包含 200 个特征。 The second data frame (df2) is a part of the first data frame (df1).第二个数据帧 (df2) 是第一个数据帧 (df1) 的一部分。 Now, I want to find the position of the columns (in df1) which includes df2 columns/variables.现在,我想找到包含 df2 列/变量的列(在 df1 中)的位置。 Here the values of the columns should be similar, not the name of the columns.这里列的值应该相似,而不是列的名称。 Taking into account above example, my desired output should be either "f1 and f2" or columns [0, 1] from df1.考虑到上面的例子,我想要的输出应该是“f1 和 f2”或来自 df1 的列 [0, 1]。
Any idea to handle the problem?任何想法来处理这个问题?

I would concat the two dataframes so I am sure only same indexes are present我会连接两个数据帧,所以我确定只存在相同的索引

result = pd.concat([df1, df2], axis=1, join='inner')

then you can use this code:那么你可以使用这个代码:

import pandas as pd 
  
def getDuplicateColumns(df): 
    duplicateColumnNames = set() 
    
    for x in range(df.shape[1]-200): 
        col = df.iloc[:, x] 
          
        for y in range(df.shape[1]-200, df.shape[1]):  
            otherCol = df.iloc[:, y] 
            #if the columns are equal mark it down  
            if col.equals(otherCol): 
                duplicateColumnNames.add(df.columns.values[y]) 
                #here you can mark down both names, so you map them
    return list(duplicateColumnNames) 

cols = getDuplicateColumns(result)

and then you can do whatever you need with the selected columns returned, ie drop the redundant cols.然后您可以对返回的选定列执行任何您需要的操作,即删除多余的列。 200 is the expected number of cols in your second df, you can instead send this as param. 200 是第二个 df 中的预期列数,您可以将其作为参数发送。 If you are sure each col in df1 has only 1 match in df2 you can as well break the inner loop after finding a match.如果您确定 df1 中的每个 col 在 df2 中只有 1 个匹配项,您也可以在找到匹配项后打破内部循环。

you need to break down this problem into part one is finding common features你需要把这个问题分解成第一部分是找到共同的特征

df1 = pd.DataFrame([[0,1,2,11],[3,4,5,12],[6,7,8,13]], columns=['A','B','C','D'])
df2 = pd.DataFrame([[1,2,11],[4,5,12],[7,8,14]], columns=['a','b','D']) 
common = set(df1.columns) & set(df2.columns)

and another is checking weather this two columns are similar or not另一个是检查天气这两列是否相似

if(df1[common].equals(df2[common])): 
     print(df1[common])
else:
     print("Nothing common")

For checking multiple columns you can create a loop on the top of if condition.要检查多列,您可以在 if 条件的顶部创建一个循环。

The common columns:常用栏目:

common = set(df1.columns) & set(df2.columns)

To get df1 columns that exist in df2:要获取 df2 中存在的 df1 列:

df1[common]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM