简体   繁体   English

检查 pandas 中的多值重复

[英]Check multi-value duplication in pandas

Inputs输入

I have a Dataframe with several columns.我有一个 Dataframe 有几列。

proof_path = 

   #1  X  Y  #2 X_  Z  #3  W Z_  #4 W_ Y_
0  p1  a  b  p2  a  c  p2  a  c  p3  a  b
1  p1  a  b  p2  a  c  p3  a  c  p1  a  b
2  p1  a  b  p2  a  d  p3  e  d  p4  e  b

In the above Dataframe, I want to examine whether each row is duplicated between [#1, X, Y] , [#2, X_, Z] , [#3, W, Z_] , and [#4, W_, Y_] .在上面的 Dataframe 中,我想检查[#1, X, Y][#2, X_, Z][#3, W, Z_][#4, W_, Y_]之间的每一行是否重复[#4, W_, Y_] .

For example, in rows corresponding to index 0, [#2, X_, Z] and [#3, W_, Z_] overlap with [P2, a, c] .例如,在对应于索引 0 的行中, [#2, X_, Z][#3, W_, Z_][P2, a, c]重叠。 In addition, [#1, X, Y] and [#4, W_, Y_] in row corresponding to index 1 overlap [P1, a, b] .此外,索引 1 对应的行中的[#1, X, Y][#4, W_, Y_][P1, a, b]重叠。 I'm going to drop rows that overlap between these multi-values from that dataframe.我将从 dataframe 中删除这些多值之间重叠的行。

My desired output is我想要的 output 是

output output

proof_path = 

   #1  X  Y  #2 X_  Z  #3  W Z_  #4 W_ Y_
2  p1  a  b  p2  a  d  p3  e  d  p4  e  b

And i tried as follows.我尝试如下。

triple_size = 3
for depth in range(int(len(proof_path.columns)/triple_size)-1):
    for i in range(1, int(len(proof_path.columns)/triple_size)-depth):
        current_rComp = proof_path.iloc[:, depth*size:(depth+1)*triple_size]
        next_rComp = proof_path.iloc[:, (depth+i)*size:(depth+i+1)*triple_size]
        current_rComp.columns = ['pred', 'subj', 'obj']
        next_rComp.columns = ['pred', 'subj', 'obj']
        proof_path = proof_path[current_rComp.ne(next_rComp).any(axis=1)]

Although these methods were able to achieve desired results, they are inefficient by generating sub set of proof_path for each iteration.尽管这些方法能够达到预期的结果,但它们通过为每次迭代生成proof_path的子集而效率低下。 Is there a simple way to accomplish these tasks?有没有简单的方法来完成这些任务?

To avoid the nested loops, you could use sets: For each row, put the four triples of values in a set.为了避免嵌套循环,您可以使用集合:对于每一行,将四个三元组的值放在一个集合中。 The number of elements in the set is the number of unique triples.集合中元素的数量是唯一三元组的数量。 Then you can use this number of unique triples as a mask to select rows:然后,您可以使用此数量的唯一三元组作为 select 行的掩码:

import numpy as np

proof_path['n_unique_triples'] = \
    proof_path.apply(lambda row: len(set((tuple(row[0:3]),
                                          tuple(row[3:6]),
                                          tuple(row[6:9]),
                                          tuple(row[9:12])))), axis=1)
    
df_select = proof_path[proof_path.n_unique_triples == 4]
df_select
    #1  X   Y   #2  X_  Z   #3  W   Z_  #4  W_  Y_  n_unique_triples
2   p1  a   b   p2  a   d   p3  e   d   p4  e   b   4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM