How to find out which columns are duplicated?

Question

I have a dataframe where I suspect some of the columns may be duplicates of each other. How can I find out which ones they are? For example:

Name          val1       val2     val3     val4

dog            4          0        2        4     
fish           0          0        8        0
falcon         2          2        10       2
falcon         2          2        10       2
fish           0          0        8        0
dog            4          0        2        4
fish           0          0        8        0
dog            4          0        2        4

I would like know that val4 and val1 are the same as each other. My dataframe has about 30 columns. In the end I just want to keep one copy of each duplicate but I do want to know which columns are being dropped.

Answer 1

You can check duplicated on transpose:

df.T.duplicated(keep=False)

Output:

Name    False
val1     True
val2    False
val3    False
val4     True
dtype: bool

And you can drop duplicated with loc :

df.loc[:,~df.T.duplicated()]

Output:

     Name  val1  val2  val3
0     dog     4     0     2
1    fish     0     0     8
2  falcon     2     2    10
3  falcon     2     2    10
4    fish     0     0     8
5     dog     4     0     2
6    fish     0     0     8
7     dog     4     0     2

Update : To identify and group duplicated columns, we can do a groupby:

df.T.groupby(list(df.index)).ngroup().sort_values()

Output:

val2    0
val3    1
val1    2
val4    2
Name    3
dtype: int64

How to find out which columns are duplicated?

Question

1 answers

solution1
3 ACCPTED 2020-06-25 15:50:32

How to find out which columns are duplicated?

Question

1 answers

solution1 3 ACCPTED 2020-06-25 15:50:32

solution1
3 ACCPTED 2020-06-25 15:50:32