简体   繁体   中英

Remove the rows that have the same column A value but different column B value from df (but not vice-versa) in R

I'm trying to remove all the rows that have the same value in the "lan" column of my dataframe but different value for my "id" column (but not vice-versa).

Using an example dataset:

require(dplyr)
t <- structure(list(id = c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 
                           4L), lan = structure(c(1L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 1L, 
                                                  7L), .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"), 
                    value = c(0.22988498, 0.848989831, 0.538065821, 0.916571913, 
                              0.304183372, 0.983348167, 0.356128559, 0.054102854, 0.400934593, 
                              0.001026817, 0.488452667)), .Names = c("id", "lan", "value"
                              ), class = "data.frame", row.names = c(NA, -11L))
t

I need to get rid of rows 1 and 10 because they have the same lan (a) but different id.

I've tried the following, without success:

a<-t[(!duplicated(t$id)),]
c<-a[duplicated(a$lan)|duplicated(a$lan, fromLast=TRUE),]
d<-t[!(t$lan %in% c$lan),]

Thanks for your help!

And an alternative using dplyr :

t2 <- t %>% 
  group_by(lan,id) %>%
  summarise(value=sum(value)) %>%
  group_by(lan) %>%
  summarise(number=n()) %>%
  filter(number>1) %>%
  select(lan)

> t[!t$lan %in% t2$lan ,]
   id lan      value
2   2   b 0.84898983
3   2   c 0.53806582
4   3   d 0.91657191
5   3   d 0.30418337
6   4   e 0.98334817
7   4   e 0.35612856
8   4   e 0.05410285
9   4   f 0.40093459
11  4   g 0.48845267

You could use duplicated on "lan", to get the logical index of all elements that are duplicates, repeat the same with both columns together ('id', 'lan'), to get the elements not duplicated, check which of these elements are TRUE in both cases, negate, and subset.

indx1 <-  with(t, duplicated(lan)|duplicated(lan,fromLast=TRUE))
indx2 <- !(duplicated(t[1:2])|duplicated(t[1:2],fromLast=TRUE))
t[!(indx1 & indx2),]
#   id lan      value
#2   2   b 0.84898983
#3   2   c 0.53806582
#4   3   d 0.91657191
#5   3   d 0.30418337
#6   4   e 0.98334817
#7   4   e 0.35612856
#8   4   e 0.05410285
#9   4   f 0.40093459
#11  4   g 0.48845267

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM