I have a dataframe where I suspect some of the columns may be duplicates of each other. How can I find out which ones they are? For example:
Name val1 val2 val3 val4
dog 4 0 2 4
fish 0 0 8 0
falcon 2 2 10 2
falcon 2 2 10 2
fish 0 0 8 0
dog 4 0 2 4
fish 0 0 8 0
dog 4 0 2 4
I would like know that val4 and val1 are the same as each other. My dataframe has about 30 columns. In the end I just want to keep one copy of each duplicate but I do want to know which columns are being dropped.
You can check duplicated on transpose:
df.T.duplicated(keep=False)
Output:
Name False
val1 True
val2 False
val3 False
val4 True
dtype: bool
And you can drop duplicated with loc
:
df.loc[:,~df.T.duplicated()]
Output:
Name val1 val2 val3
0 dog 4 0 2
1 fish 0 0 8
2 falcon 2 2 10
3 falcon 2 2 10
4 fish 0 0 8
5 dog 4 0 2
6 fish 0 0 8
7 dog 4 0 2
Update : To identify and group duplicated columns, we can do a groupby:
df.T.groupby(list(df.index)).ngroup().sort_values()
Output:
val2 0
val3 1
val1 2
val4 2
Name 3
dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.