简体   繁体   English

根据两个数据帧中多列中的条件删除重复项

[英]Remove duplicates based on criteria in multiple columns across two data frames

I have 2 data frames that I need to compare to remove duplicates. 我需要比较2个数据框以删除重复项。 DF1 has columns A, B, C, D, E, F, and DF2 has columns A, B, C, G, H, I. I want to get all rows from DF1 where either column A or B matches either column A or B from DF2 AND DF2 column G is not "Y" DF1具有A,B,C,D,E,F列,而DF2具有A,B,C,G,H,I列。我想从DF1中获取所有行,其中A或B列与A列或B列匹配DF2和DF2列G中的B不是“ Y”

So something along the lines of 所以类似的东西

DF3 <- subset (DF1, (A | B %in% DF2$A | DF2$B) & (C %in% DF2$C) & (DF2$G != "Y"))

But I cant get the logical operators to work within the subset. 但是我不能让逻辑运算符在子集中工作。 Is there any way to accomplish this? 有什么办法可以做到这一点?

You can do this using an inner join with sqldf 您可以使用带有sqldf的内部联接来执行此操作

Example data . 示例数据。 Please provide this yourself in the future. 以后请自己提供。

df1 <- data.frame(a = 1:10, b = 1:10, c = 1:10, g = tail(letters, 10))
set.seed(2019)
df2 <- as.data.frame(lapply(df1, function(x) sample(x, replace = TRUE)))

Inner join and output: 内部联接和输出:

library(sqldf)
sqldf("
select  a.*
from    df1 a
        join df2 b      
          on  (a.a = b.a or a.b = b.b)
              and a.c = b.c
where   b.g <> 'y'
")

#   a b c g
# 1 2 2 2 r
# 2 1 1 1 q
# 3 5 5 5 u

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM