简体   繁体   English

根据条件删除R中的重复项

[英]Removing duplicates in R based on condition

I need to embed a condition in a remove duplicates function.我需要在删除重复函数中嵌入一个条件。 I am working with large student database from South Africa, a highly multilingual country.我正在使用来自南非这个高度多语言国家的大型学生数据库。 Last week you guys gave me the code to remove duplicates caused by retakes, but I now realise my language exam data shows some students offering more than 2 different languages.上周你们给了我删除重考造成的重复的代码,但我现在意识到我的语言考试数据显示一些学生提供超过 2 种不同的语言。 The source data, simplified looks like this源数据,简化看起来像这样

STUDID   MATSUBJ     SCORE
101      AFRIKAANSB   1
101      AFRIKAANSB   4
102      ENGLISHB     2
102      ISIZULUB     7
102      ENGLISHB     5

The result file I need is我需要的结果文件是

STUDID   MATSUBJ    SCORE  flagextra
101      AFRIKAANS   4
102      ENGLISH     5
102      ISIZULUB    7     1

I need to flag the extra language so that I can see what languages they are and make new category for this我需要标记额外的语言,以便我可以看到它们是什么语言并为此创建新类别

May be this helps可能这有帮助

library(tidyverse)
df1 %>% 
   group_by(STUDID, MATSUBJ) %>% 
   summarise(SCORE = max(SCORE), 
             flagextra = as.integer(!sum(duplicated(MATSUBJ))))
# A tibble: 3 x 4
# Groups:   STUDID [?]
#  STUDID MATSUBJ    SCORE flagextra
#   <int> <chr>      <dbl>     <int>
#1    101 AFRIKAANSB     4         0
#2    102 ENGLISHB       5         0
#3    102 ISIZULUB       7         1

Or with base R或与base R

i1 <- !(duplicated(df1[1:2])|duplicated(df1[1:2], fromLast = TRUE))
transform(aggregate(SCORE ~ ., df1, max), 
          flagextra = as.integer(MATSUBJ %in% df1$MATSUBJ[i1]))

data数据

df1 <- structure(list(STUDID = c(101L, 101L, 102L, 102L, 102L), MATSUBJ 
      = c("AFRIKAANSB", 
 "AFRIKAANSB", "ENGLISHB", "ISIZULUB", "ENGLISHB"), SCORE = c(1L, 
 4L, 2L, 7L, 5L)), class = "data.frame", row.names = c(NA, -5L
 ))

Two stage procedure works better for me as a newbie to R:作为 R 的新手,两阶段程序对我来说效果更好:

remove the duplicates caused by subject retakes df<-LANGSEC%>%group_by (STUDID,MATRICSUBJ) %>%top_n(1,SUBJSCORE) #Then flag one of the two subjects causing the remaining duplicates LANGSEC$flagextra<-as.integer(duplicated(LANGSEC$STUDID),LANGSEC$MATRICSUBJ # Then filter for this third language and make new file LANG3<-LANGSEC%>% filter (flagextra==1) #Then remove these from the other file LANG2<-LANGSEC %>% filter (!flagextra==1)删除由科目重考引起的重复项 df<-LANGSEC%>%group_by (STUDID,MATRICSUBJ) %>%top_n(1,SUBJSCORE) #然后标记导致剩余重复项的两个科目之一 LANGSEC$flagextra<-as.integer( duplicated(LANGSEC$STUDID),LANGSEC$MATRICSUBJ # 然后过滤这第三种语言并制作新文件 LANG3<-LANGSEC%>% filter (flagextra==1) #然后从另一个文件中删除这些 LANG2<-LANGSEC %>%过滤器 (!flagextra==1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM