[英]Find identical observations in a column´s data frame but different in another column
In R, I have a data frame which includes a ID column.在 R 中,我有一个包含 ID 列的数据框。 I need to find all the rows that have the same ID but are different in the X1 variable.我需要找到所有具有相同 ID 但在 X1 变量中不同的行。
For example,例如,
d
ID X1 X2
a 19 F
b 19 F
c 16 T
a 16 T
a 19 T
d 17 T
b 15 F
b 19 F
c 17 T
c 17 T
d 17 T
e 15 T
f 14 T
g 16 T
The result will be:结果将是:
df1
ID X1 X2
a 19 F
b 19 F
c 16 T
a 16 T
b 15 F
c 17 T
t <- table(d$X1, d$ID)
t[t>1] <- 1
t <- apply(t,2,sum)
t <- t[t>1]
d1 <- data.frame(ID = names(t))
d1 <- merge(d1, d, by = "ID", all.x=T,all.y=F)
d1 <- unique(d1[,1:2])
d1
ID X1 1 a 19 2 a 16 4 b 15 5 b 19 7 c 16 8 c 17
We can include the 3rd column as well, but you'd need to give some logic to pick which value of it to retain.我们也可以包含第 3 列,但您需要给出一些逻辑来选择要保留的值。 For instance, there were 2 values of a
where X1
was 19, one with X2
T and one where it was F. To choose between the 2 you could keep the first matching row for X2
, the last, or choose T above F, etc.举例来说,有2个取值a
,其中X1
为19,一个与X2
T和一个它被F.到2之间选择,你可以保留第一个匹配行的X2
,最后还是选择T上方楼等.
We can remove the single ids first.我们可以先删除单个 id。 Then get a count of the ids left.然后计算剩余的 id。 If there is a single id left we remove it:如果只剩下一个 id,我们将其删除:
newdf <- df1[duplicated(df1$ID, fromLast=TRUE),]
tbl <- table(newdf$ID)
newdf[!newdf$ID %in% names(tbl[tbl < 2]),]
# ID X1 X2
# 1 a 19 FALSE
# 2 b 19 FALSE
# 3 c 16 TRUE
# 4 a 16 TRUE
# 7 b 15 FALSE
# 9 c 17 TRUE
这行得通吗?
df1[rownames(unique(df1[,c("ID","X1")])),]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.