检查 pandas 中的多值重复

Question

Inputs输入

I have a Dataframe with several columns.我有一个 Dataframe 有几列。

proof_path = 

   #1  X  Y  #2 X_  Z  #3  W Z_  #4 W_ Y_
0  p1  a  b  p2  a  c  p2  a  c  p3  a  b
1  p1  a  b  p2  a  c  p3  a  c  p1  a  b
2  p1  a  b  p2  a  d  p3  e  d  p4  e  b

In the above Dataframe, I want to examine whether each row is duplicated between [#1, X, Y] , [#2, X_, Z] , [#3, W, Z_] , and [#4, W_, Y_] .在上面的 Dataframe 中，我想检查[#1, X, Y] 、 [#2, X_, Z] 、 [#3, W, Z_]和[#4, W_, Y_]之间的每一行是否重复[#4, W_, Y_] .

For example, in rows corresponding to index 0, [#2, X_, Z] and [#3, W_, Z_] overlap with [P2, a, c] .例如，在对应于索引 0 的行中， [#2, X_, Z]和[#3, W_, Z_]与[P2, a, c]重叠。 In addition, [#1, X, Y] and [#4, W_, Y_] in row corresponding to index 1 overlap [P1, a, b] .此外，索引 1 对应的行中的[#1, X, Y]和[#4, W_, Y_]与[P1, a, b]重叠。 I'm going to drop rows that overlap between these multi-values from that dataframe.我将从 dataframe 中删除这些多值之间重叠的行。

My desired output is我想要的 output 是

output output

proof_path = 

   #1  X  Y  #2 X_  Z  #3  W Z_  #4 W_ Y_
2  p1  a  b  p2  a  d  p3  e  d  p4  e  b

And i tried as follows.我尝试如下。

triple_size = 3
for depth in range(int(len(proof_path.columns)/triple_size)-1):
    for i in range(1, int(len(proof_path.columns)/triple_size)-depth):
        current_rComp = proof_path.iloc[:, depth*size:(depth+1)*triple_size]
        next_rComp = proof_path.iloc[:, (depth+i)*size:(depth+i+1)*triple_size]
        current_rComp.columns = ['pred', 'subj', 'obj']
        next_rComp.columns = ['pred', 'subj', 'obj']
        proof_path = proof_path[current_rComp.ne(next_rComp).any(axis=1)]

Although these methods were able to achieve desired results, they are inefficient by generating sub set of proof_path for each iteration.尽管这些方法能够达到预期的结果，但它们通过为每次迭代生成proof_path的子集而效率低下。 Is there a simple way to accomplish these tasks?有没有简单的方法来完成这些任务？

Answer 1

To avoid the nested loops, you could use sets: For each row, put the four triples of values in a set.为了避免嵌套循环，您可以使用集合：对于每一行，将四个三元组的值放在一个集合中。 The number of elements in the set is the number of unique triples.集合中元素的数量是唯一三元组的数量。 Then you can use this number of unique triples as a mask to select rows:然后，您可以使用此数量的唯一三元组作为 select 行的掩码：

import numpy as np

proof_path['n_unique_triples'] = \
    proof_path.apply(lambda row: len(set((tuple(row[0:3]),
                                          tuple(row[3:6]),
                                          tuple(row[6:9]),
                                          tuple(row[9:12])))), axis=1)
    
df_select = proof_path[proof_path.n_unique_triples == 4]
df_select

    #1  X   Y   #2  X_  Z   #3  W   Z_  #4  W_  Y_  n_unique_triples
2   p1  a   b   p2  a   d   p3  e   d   p4  e   b   4

检查 pandas 中的多值重复

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-01 10:53:31

检查 pandas 中的多值重复

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-01 10:53:31

解决方案1
1 已采纳 2021-06-01 10:53:31