简体   繁体   English

Pandas 删除两列的反向重复

[英]Pandas remove reversed duplicates across two columns

An example DataFrame: DataFrame 示例:

df = pd.DataFrame({'node_a': ['X', 'X', 'X', 'Y', 'Y', 'Y', 'Z', 'Z', 'Z'],
                   'node_b': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z'],
                   'value':  [  2,   8,   1,   8,   7,   3,   1,   3,   2]})

    node_a  node_b  value
0   X       X       2
1   X       Y       8
2   X       Z       1
3   Y       X       8
4   Y       Y       7
5   Y       Z       3
6   Z       X       1
7   Z       Y       3
8   Z       Z       2

I need to remove reversed duplicates, eg keep node_a = 'X', node_b = 'Y' but remove node_a = 'Y', node_b = 'X'.我需要删除反向重复,例如保留 node_a = 'X'、node_b = 'Y' 但删除 node_a = 'Y'、node_b = 'X'。

Desired output:所需的 output:

    node_a  node_b  value
0   X       X       2
1   X       Y       8
2   X       Z       1
4   Y       Y       7
5   Y       Z       3
8   Z       Z       2

Please note I need a general solution not specific to this actual data.请注意,我需要一个不特定于该实际数据的通用解决方案。

Let's use np.sort along axis=1 to sort node_a and node_b and assign these sorted columns to the dataframe then use drop_duplicates on the dataframe to drop the duplicate entries in dataframe based on these assigned columns: Let's use np.sort along axis=1 to sort node_a and node_b and assign these sorted columns to the dataframe then use drop_duplicates on the dataframe to drop the duplicate entries in dataframe based on these assigned columns:

df[['x', 'y']] = np.sort(df[['node_a', 'node_b']], axis=1)
out = df.drop_duplicates(['x', 'y']).drop(['x', 'y'], 1)

Result:结果:

print(out)
  node_a node_b  value
0      X      X      2
1      X      Y      8
2      X      Z      1
4      Y      Y      7
5      Y      Z      3
8      Z      Z      2

You could do the following:您可以执行以下操作:

# duplicates regardless the order
un_dups = pd.Series([frozenset(row) for row in df[['node_a', 'node_b']].to_numpy()]).duplicated()

# duplicates with the same order
o_dups = df.duplicated(subset=['node_a', 'node_b'])

# keep only those that are not duplicates with reverse order xor
mask = ~(un_dups ^ o_dups)

print(df[mask])

Output Output

  node_a node_b  value
0      X      X      2
1      X      Y      8
2      X      Z      1
4      Y      Y      7
5      Y      Z      3
8      Z      Z      2

The idea is to create a mask that will be False if you are a duplicate in reverse order.这个想法是创建一个掩码,如果您是相反顺序的副本,则该掩码将为 False。

To better understand the approach see the truth values:为了更好地理解该方法,请查看真值:

  node_a node_b  value  un_dups  o_dups    xor
0      X      X      2    False   False  False
1      X      Y      8    False   False  False
2      X      Z      1    False   False  False
3      Y      X      8     True   False   True
4      Y      Y      7    False   False  False
5      Y      Z      3    False   False  False
6      Z      X      1     True   False   True
7      Z      Y      3     True   False   True
8      Z      Z      2    False   False  False

As you can see the xor ( exclusive or ) shows that it output is true whenever the inputs differ.正如您所看到的,异或(异或)表明只要输入不同,output 就为真。 Given that an ordered duplicated is going to be also duplicated when unordered, then xor is only true when the values in the column are duplicates in reverse order.鉴于有序的重复项在无序时也将被重复,那么 xor 仅当列中的值以相反的顺序重复时才为真。

Finally notice that the mask is the negation of the xor, ie those values that are not duplicates.最后请注意,掩码是异或的否定,即那些不重复的值。

Here's one way to do it which involves creating a new temporary column that will sort the order of node_a and node_b in each row, and then drop duplicates, keeping the first instance of the ordering:这是一种方法,它涉及创建一个新的临时列,该列将对每行中 node_a 和 node_b 的顺序进行排序,然后删除重复项,保留排序的第一个实例:

df['sorted'] = df.apply(lambda x: ''.join(sorted([x['node_a'],x['node_b']])),axis=1)

#   node_a node_b  value sorted
# 0      X      X      2     XX
# 1      X      Y      8     XY
# 2      X      Z      1     XZ
# 3      Y      X      8     XY
# 4      Y      Y      7     YY
# 5      Y      Z      3     YZ
# 6      Z      X      1     XZ
# 7      Z      Y      3     YZ
# 8      Z      Z      2     ZZ

df.drop_duplicates(subset='sorted').drop('sorted',axis=1)

#   node_a node_b  value
# 0      X      X      2
# 1      X      Y      8
# 2      X      Z      1
# 4      Y      Y      7
# 5      Y      Z      3
# 8      Z      Z      2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM