简体   繁体   English

如何比较 Dataframe 中的多值重复项

[英]How to compare multi-value duplicates in Dataframe

Inputs输入

I have a Dataframe with several columns.我有一个 Dataframe 有几列。 And list并列出

proof_path = 

   #1  X  Y  #2  Z  #3  W  #4
0  p1  a  b  p2  c  p2  a  p3
1  p1  a  b  p2  c  p3  a  p1
2  p1  a  b  p2  d  p3  e  p4

rule = [('#1', 'X', 'Y'), ('#2', 'X', 'Z'), ('#3', 'W', 'Z'), ('#4', 'W', 'Y')]

In the above Dataframe, I want to examine whether each row is duplicated between (#1, X, Y) , (#2, X, Z) , (#3, W, Z) , and (#4, W, Y) .在上面的 Dataframe 中,我想检查每一行是否在(#1, X, Y)(#2, X, Z)(#3, W, Z)(#4, W, Y) .

For example In the row corresponding to index 0, (#2, X, Z) and (#3, W, Z) overlap (P2, a, c) .例如,在索引 0 对应的行中, (#2, X, Z)(#3, W, Z)重叠(P2, a, c)

In addition, (#1, X, Y) and (#4, W, Y) in row corresponding to index 1 overlap (P1, a, b) .此外,与索引1对应的行中的(#1, X, Y)(#4, W, Y)重叠(P1, a, b) I'm going to drop rows that overlap between these multi-values from that dataframe.我将从 dataframe 中删除这些多值之间重叠的行。

My desired output is我想要的 output 是

output output

proof_path = 

   #1  X  Y  #2  Z  #3  W  #4
2  p1  a  b  p2  d  p3  e  p4

And i tried as follows.我尝试如下。

for depth in range(len(rule)-1):
    for i in range(1, len(rule)-depth):
        current_rComp = proof_path[[rule[depth][0], rule[depth][1], rule[depth][2]]]
        current_rComp.columns = ['pred', 'subj', 'obj']
        next_rComp = proof_path[[rule[i+depth][0], rule[i+depth][1], rule[i+depth][2]]]
        next_rComp.columns = ['pred', 'subj', 'obj']
        proof_path = proof_path[current_rComp.ne(next_rComp).any(axis=1)]

Although these methods were able to achieve desired results, they are inefficient by generating new Dataframes for each iteration.尽管这些方法能够达到预期的结果,但它们通过为每次迭代生成新的数据帧而效率低下。 Is there a simple way to accomplish these tasks?有没有简单的方法来完成这些任务?

Create a placeholder mask initially containing False values, essentially this mask will contain True if there any duplicates found in the corresponding row.创建一个最初包含False值的占位符mask ,如果在相应行中找到任何重复项,则基本上此mask将包含True

Generate length two combinations from rule list, then for each combination compare the slices of dataframe in order to create a boolean mask, now reduce this mask with all along axis=1 and take the logical or of the reduced mask with the placeholder maskrule列表中生成长度两个combinations ,然后为每个组合比较 dataframe 的切片以创建 boolean 掩码, all沿axis=1减少此掩码,并将减少掩码的逻辑或与占位符掩码

from itertools import combinations

mask = np.full(len(df), False)
for x, y in combinations(rule, 2):
    mask |= (df[[*x]].values == df[[*y]].values).all(1)

Alternatively we can also wrap the above approach inside a list comprehension或者,我们也可以将上述方法包装在列表理解中

mask = np.any([(df[[*x]].values == df[[*y]].values).all(1) 
               for x, y in combinations(rule, 2)], axis=0)

>>> df[~mask]

   #1  X  Y  #2  Z  #3  W  #4
2  p1  a  b  p2  d  p3  e  p4

You can drop the rows that have duplicates on a subset of columns like -您可以删除在列子集上具有重复项的行,例如 -

df = df.drop_duplicates(subset=['#1', 'X', 'Y'],keep=False)
df = df.drop_duplicates(subset=['#2', 'X', 'Z'],keep=False)
df = df.drop_duplicates(subset=['#3', 'W', 'Z'],keep=False)

Refer to the documentation for additional parameters.有关其他参数,请参阅文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM