简体   繁体   中英

How to find out which columns are duplicated?

I have a dataframe where I suspect some of the columns may be duplicates of each other. How can I find out which ones they are? For example:

Name          val1       val2     val3     val4

dog            4          0        2        4     
fish           0          0        8        0
falcon         2          2        10       2
falcon         2          2        10       2
fish           0          0        8        0
dog            4          0        2        4
fish           0          0        8        0
dog            4          0        2        4

I would like know that val4 and val1 are the same as each other. My dataframe has about 30 columns. In the end I just want to keep one copy of each duplicate but I do want to know which columns are being dropped.

You can check duplicated on transpose:

df.T.duplicated(keep=False)

Output:

Name    False
val1     True
val2    False
val3    False
val4     True
dtype: bool

And you can drop duplicated with loc :

df.loc[:,~df.T.duplicated()]

Output:

     Name  val1  val2  val3
0     dog     4     0     2
1    fish     0     0     8
2  falcon     2     2    10
3  falcon     2     2    10
4    fish     0     0     8
5     dog     4     0     2
6    fish     0     0     8
7     dog     4     0     2

Update : To identify and group duplicated columns, we can do a groupby:

df.T.groupby(list(df.index)).ngroup().sort_values()

Output:

val2    0
val3    1
val1    2
val4    2
Name    3
dtype: int64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM