简体   繁体   中英

Comparing the columns of two data frames?

I have two data frames as follows:

  df1:
            id,   f1,   f2,..., f800
            0,     5,  5.6,..,  3,7
            1,   2.4,  1.6,..,  1,7
            2,     3,  2.3,..,  4,4
            ....
            n,   4.7,  9,3,..., 8,2

 df2:
            id,   v1,   v2,..., v200
            0,     5,  5.6,..,  5,7
            1,   2.4,  1.6,..,  6,7
            2,     3,  2.3,..,  4,2
            ....
            n,   4.7,  9,3,..., 3,1

The df1 consists of 800 features and df2 includes only 200 features. The second data frame (df2) is a part of the first data frame (df1). Now, I want to find the position of the columns (in df1) which includes df2 columns/variables. Here the values of the columns should be similar, not the name of the columns. Taking into account above example, my desired output should be either "f1 and f2" or columns [0, 1] from df1.
Any idea to handle the problem?

I would concat the two dataframes so I am sure only same indexes are present

result = pd.concat([df1, df2], axis=1, join='inner')

then you can use this code:

import pandas as pd 
  
def getDuplicateColumns(df): 
    duplicateColumnNames = set() 
    
    for x in range(df.shape[1]-200): 
        col = df.iloc[:, x] 
          
        for y in range(df.shape[1]-200, df.shape[1]):  
            otherCol = df.iloc[:, y] 
            #if the columns are equal mark it down  
            if col.equals(otherCol): 
                duplicateColumnNames.add(df.columns.values[y]) 
                #here you can mark down both names, so you map them
    return list(duplicateColumnNames) 

cols = getDuplicateColumns(result)

and then you can do whatever you need with the selected columns returned, ie drop the redundant cols. 200 is the expected number of cols in your second df, you can instead send this as param. If you are sure each col in df1 has only 1 match in df2 you can as well break the inner loop after finding a match.

you need to break down this problem into part one is finding common features

df1 = pd.DataFrame([[0,1,2,11],[3,4,5,12],[6,7,8,13]], columns=['A','B','C','D'])
df2 = pd.DataFrame([[1,2,11],[4,5,12],[7,8,14]], columns=['a','b','D']) 
common = set(df1.columns) & set(df2.columns)

and another is checking weather this two columns are similar or not

if(df1[common].equals(df2[common])): 
     print(df1[common])
else:
     print("Nothing common")

For checking multiple columns you can create a loop on the top of if condition.

The common columns:

common = set(df1.columns) & set(df2.columns)

To get df1 columns that exist in df2:

df1[common]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM