[英]Removing duplicates in R based on condition
I need to embed a condition in a remove duplicates function.我需要在删除重复函数中嵌入一个条件。 I am working with large student database from South Africa, a highly multilingual country.
我正在使用来自南非这个高度多语言国家的大型学生数据库。 Last week you guys gave me the code to remove duplicates caused by retakes, but I now realise my language exam data shows some students offering more than 2 different languages.
上周你们给了我删除重考造成的重复的代码,但我现在意识到我的语言考试数据显示一些学生提供超过 2 种不同的语言。 The source data, simplified looks like this
源数据,简化看起来像这样
STUDID MATSUBJ SCORE
101 AFRIKAANSB 1
101 AFRIKAANSB 4
102 ENGLISHB 2
102 ISIZULUB 7
102 ENGLISHB 5
The result file I need is我需要的结果文件是
STUDID MATSUBJ SCORE flagextra
101 AFRIKAANS 4
102 ENGLISH 5
102 ISIZULUB 7 1
I need to flag the extra language so that I can see what languages they are and make new category for this我需要标记额外的语言,以便我可以看到它们是什么语言并为此创建新类别
May be this helps可能这有帮助
library(tidyverse)
df1 %>%
group_by(STUDID, MATSUBJ) %>%
summarise(SCORE = max(SCORE),
flagextra = as.integer(!sum(duplicated(MATSUBJ))))
# A tibble: 3 x 4
# Groups: STUDID [?]
# STUDID MATSUBJ SCORE flagextra
# <int> <chr> <dbl> <int>
#1 101 AFRIKAANSB 4 0
#2 102 ENGLISHB 5 0
#3 102 ISIZULUB 7 1
Or with base R
或与
base R
i1 <- !(duplicated(df1[1:2])|duplicated(df1[1:2], fromLast = TRUE))
transform(aggregate(SCORE ~ ., df1, max),
flagextra = as.integer(MATSUBJ %in% df1$MATSUBJ[i1]))
df1 <- structure(list(STUDID = c(101L, 101L, 102L, 102L, 102L), MATSUBJ
= c("AFRIKAANSB",
"AFRIKAANSB", "ENGLISHB", "ISIZULUB", "ENGLISHB"), SCORE = c(1L,
4L, 2L, 7L, 5L)), class = "data.frame", row.names = c(NA, -5L
))
Two stage procedure works better for me as a newbie to R:作为 R 的新手,两阶段程序对我来说效果更好:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.