简体   繁体   English

删除熊猫中的重复数据

[英]Remove repeating data in pandas

I have 4 columns of data A , B , C , D .我有 4 列数据A , B , C , D Some data are repeating such as row 1: P2 XX P6 XX is repeating in row 5: P6 XX P2 XX .一些数据正在重复,例如第 1 行: P2 XX P6 XX在第 5 行重复: P6 XX P2 XX Can anyone help me to remove the repeating units from Pandas dataframe?谁能帮我从 Pandas 数据框中删除重复单元?

A   B   C   D
P2  XX  P6  XX
P3  XX  P5  XX
P5  XX  P8  XX
P5  XX  P3  XX
P6  XX  P2  XX
P8  XX  P5  XX
P1  LU  P2  LU
P2  LU  P1  LU
P3  LU  P9  LU
P3  LU  P6  LU
P6  LU  P3  LU
P9  LU  P3  LU

Output:输出:

A  B  C  D 
P2 XX P6 XX 
P3 XX P5 XX 
P5 XX P8 XX 
P1 LU P2 LU 
P3 LU P9 LU 
P3 LU P6 LU

Assuming it's okay to swap columns A and C , you can use np.minimum and np.maximum to swap the two columns and then drop duplicates:假设可以交换列AC ,您可以使用np.minimumnp.maximum交换两列,然后删除重复项:

import numpy as np
df.A, df.C = np.minimum(df.A, df.C), np.maximum(df.A, df.C)

df.drop_duplicates()
    A   B   C   D
0  P2  XX  P6  XX
1  P3  XX  P5  XX
2  P5  XX  P8  XX
6  P1  LU  P2  LU
8  P3  LU  P9  LU
9  P3  LU  P6  LU

We can use np.sort on axis=1 sort sort values in rows, then drop_duplicates on the sorted frame.我们可以使用np.sort on axis=1 对行中的排序值进行排序,然后在排序框架上使用drop_duplicates Lastly, use the index to filter df :最后,使用索引过滤df

import numpy as np


idx = (
    pd.DataFrame(
        np.sort(df.values, axis=1), columns=df.columns
    ).drop_duplicates().index
)

df = df.loc[idx]

Or without a second variable:或者没有第二个变量:

df = df.loc[
    pd.DataFrame(
        np.sort(df.values, axis=1), columns=df.columns
    ).drop_duplicates().index
]

df : df

    A   B   C   D
0  P2  XX  P6  XX
1  P3  XX  P5  XX
2  P5  XX  P8  XX
6  P1  LU  P2  LU
8  P3  LU  P9  LU
9  P3  LU  P6  LU

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM