如何比较 Dataframe 中的多值重复项

Question

Inputs输入

I have a Dataframe with several columns.我有一个 Dataframe 有几列。 And list并列出

proof_path = 

   #1  X  Y  #2  Z  #3  W  #4
0  p1  a  b  p2  c  p2  a  p3
1  p1  a  b  p2  c  p3  a  p1
2  p1  a  b  p2  d  p3  e  p4

rule = [('#1', 'X', 'Y'), ('#2', 'X', 'Z'), ('#3', 'W', 'Z'), ('#4', 'W', 'Y')]

In the above Dataframe, I want to examine whether each row is duplicated between (#1, X, Y) , (#2, X, Z) , (#3, W, Z) , and (#4, W, Y) .在上面的 Dataframe 中，我想检查每一行是否在(#1, X, Y) 、 (#2, X, Z) 、 (#3, W, Z)和(#4, W, Y) .

For example In the row corresponding to index 0, (#2, X, Z) and (#3, W, Z) overlap (P2, a, c) .例如，在索引 0 对应的行中， (#2, X, Z)和(#3, W, Z)重叠(P2, a, c) 。

In addition, (#1, X, Y) and (#4, W, Y) in row corresponding to index 1 overlap (P1, a, b) .此外，与索引1对应的行中的(#1, X, Y)和(#4, W, Y)重叠(P1, a, b) 。 I'm going to drop rows that overlap between these multi-values from that dataframe.我将从 dataframe 中删除这些多值之间重叠的行。

My desired output is我想要的 output 是

output output

proof_path = 

   #1  X  Y  #2  Z  #3  W  #4
2  p1  a  b  p2  d  p3  e  p4

And i tried as follows.我尝试如下。

for depth in range(len(rule)-1):
    for i in range(1, len(rule)-depth):
        current_rComp = proof_path[[rule[depth][0], rule[depth][1], rule[depth][2]]]
        current_rComp.columns = ['pred', 'subj', 'obj']
        next_rComp = proof_path[[rule[i+depth][0], rule[i+depth][1], rule[i+depth][2]]]
        next_rComp.columns = ['pred', 'subj', 'obj']
        proof_path = proof_path[current_rComp.ne(next_rComp).any(axis=1)]

Although these methods were able to achieve desired results, they are inefficient by generating new Dataframes for each iteration.尽管这些方法能够达到预期的结果，但它们通过为每次迭代生成新的数据帧而效率低下。 Is there a simple way to accomplish these tasks?有没有简单的方法来完成这些任务？

Answer 1

Create a placeholder mask initially containing False values, essentially this mask will contain True if there any duplicates found in the corresponding row.创建一个最初包含False值的占位符mask ，如果在相应行中找到任何重复项，则基本上此mask将包含True 。

Generate length two combinations from rule list, then for each combination compare the slices of dataframe in order to create a boolean mask, now reduce this mask with all along axis=1 and take the logical or of the reduced mask with the placeholder mask从rule列表中生成长度两个combinations ，然后为每个组合比较 dataframe 的切片以创建 boolean 掩码， all沿axis=1减少此掩码，并将减少掩码的逻辑或与占位符掩码

from itertools import combinations

mask = np.full(len(df), False)
for x, y in combinations(rule, 2):
    mask |= (df[[*x]].values == df[[*y]].values).all(1)

Alternatively we can also wrap the above approach inside a list comprehension或者，我们也可以将上述方法包装在列表理解中

mask = np.any([(df[[*x]].values == df[[*y]].values).all(1) 
               for x, y in combinations(rule, 2)], axis=0)

>>> df[~mask]

   #1  X  Y  #2  Z  #3  W  #4
2  p1  a  b  p2  d  p3  e  p4

Answer 2

You can drop the rows that have duplicates on a subset of columns like -您可以删除在列子集上具有重复项的行，例如 -

df = df.drop_duplicates(subset=['#1', 'X', 'Y'],keep=False)
df = df.drop_duplicates(subset=['#2', 'X', 'Z'],keep=False)
df = df.drop_duplicates(subset=['#3', 'W', 'Z'],keep=False)

Refer to the documentation for additional parameters.有关其他参数，请参阅文档。

如何比较 Dataframe 中的多值重复项

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-05-31 09:33:16

解决方案2
0 2021-05-31 09:00:40

如何比较 Dataframe 中的多值重复项

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-05-31 09:33:16

解决方案2 0 2021-05-31 09:00:40

解决方案1
2 已采纳 2021-05-31 09:33:16

解决方案2
0 2021-05-31 09:00:40