[英]How to compare multi-value duplicates in Dataframe
Inputs输入
I have a Dataframe with several columns.我有一个 Dataframe 有几列。 And list
并列出
proof_path =
#1 X Y #2 Z #3 W #4
0 p1 a b p2 c p2 a p3
1 p1 a b p2 c p3 a p1
2 p1 a b p2 d p3 e p4
rule = [('#1', 'X', 'Y'), ('#2', 'X', 'Z'), ('#3', 'W', 'Z'), ('#4', 'W', 'Y')]
In the above Dataframe, I want to examine whether each row is duplicated between (#1, X, Y)
, (#2, X, Z)
, (#3, W, Z)
, and (#4, W, Y)
.在上面的 Dataframe 中,我想检查每一行是否在
(#1, X, Y)
、 (#2, X, Z)
、 (#3, W, Z)
和(#4, W, Y)
.
For example In the row corresponding to index 0, (#2, X, Z)
and (#3, W, Z)
overlap (P2, a, c)
.例如,在索引 0 对应的行中,
(#2, X, Z)
和(#3, W, Z)
重叠(P2, a, c)
。
In addition, (#1, X, Y)
and (#4, W, Y)
in row corresponding to index 1 overlap (P1, a, b)
.此外,与索引1对应的行中的
(#1, X, Y)
和(#4, W, Y)
重叠(P1, a, b)
。 I'm going to drop rows that overlap between these multi-values from that dataframe.我将从 dataframe 中删除这些多值之间重叠的行。
My desired output is我想要的 output 是
output output
proof_path =
#1 X Y #2 Z #3 W #4
2 p1 a b p2 d p3 e p4
And i tried as follows.我尝试如下。
for depth in range(len(rule)-1):
for i in range(1, len(rule)-depth):
current_rComp = proof_path[[rule[depth][0], rule[depth][1], rule[depth][2]]]
current_rComp.columns = ['pred', 'subj', 'obj']
next_rComp = proof_path[[rule[i+depth][0], rule[i+depth][1], rule[i+depth][2]]]
next_rComp.columns = ['pred', 'subj', 'obj']
proof_path = proof_path[current_rComp.ne(next_rComp).any(axis=1)]
Although these methods were able to achieve desired results, they are inefficient by generating new Dataframes for each iteration.尽管这些方法能够达到预期的结果,但它们通过为每次迭代生成新的数据帧而效率低下。 Is there a simple way to accomplish these tasks?
有没有简单的方法来完成这些任务?
Create a placeholder mask
initially containing False
values, essentially this mask
will contain True
if there any duplicates found in the corresponding row.创建一个最初包含
False
值的占位符mask
,如果在相应行中找到任何重复项,则基本上此mask
将包含True
。
Generate length two combinations
from rule
list, then for each combination compare the slices of dataframe in order to create a boolean mask, now reduce this mask with all
along axis=1
and take the logical or of the reduced mask with the placeholder mask从
rule
列表中生成长度两个combinations
,然后为每个组合比较 dataframe 的切片以创建 boolean 掩码, all
沿axis=1
减少此掩码,并将减少掩码的逻辑或与占位符掩码
from itertools import combinations
mask = np.full(len(df), False)
for x, y in combinations(rule, 2):
mask |= (df[[*x]].values == df[[*y]].values).all(1)
Alternatively we can also wrap the above approach inside a list comprehension或者,我们也可以将上述方法包装在列表理解中
mask = np.any([(df[[*x]].values == df[[*y]].values).all(1)
for x, y in combinations(rule, 2)], axis=0)
>>> df[~mask]
#1 X Y #2 Z #3 W #4
2 p1 a b p2 d p3 e p4
You can drop the rows that have duplicates on a subset of columns like -您可以删除在列子集上具有重复项的行,例如 -
df = df.drop_duplicates(subset=['#1', 'X', 'Y'],keep=False)
df = df.drop_duplicates(subset=['#2', 'X', 'Z'],keep=False)
df = df.drop_duplicates(subset=['#3', 'W', 'Z'],keep=False)
Refer to the documentation for additional parameters.有关其他参数,请参阅文档。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.