[英]Removing duplicates from DataFrame in R
我有这个数据
UserID Quiz_answers Quiz_Date
1 `a1,a2,a3`Positive 26-01-2017
1 `a1,a4,a3`Positive 26-01-2017
1 `a1,a2,a4`Negative 28-02-2017
1 `a1,a2,a3`Neutral 30-10-2017
1 `a1,a2,a4`Positive 30-11-2017
1 `a1,a2,a4`Negative 28-02-2018
2 `a1,a2,a3`Negative 27-01-2017
2 `a1,a7,a3`Neutral 28-08-2017
2 `a1,a2,a5`Negative 28-01-2017
我想删除重复的行:
行重复的规则是:
对于此类行,如果userID和Quiz_Date列的值也相同,则该行是重复的。
UserID<-c(1,1,1,1,1,1,2,2,2) Quiz_answers<-c("`a1,a2,a3`Positive","`a1,a4,a3`Positive","`a1,a2,a4`Negative","a1,a2,a3`Neutral","`a1,a2,a4`Positive","`a1,a2,a4`Negative","`a1,a2,a3`Negative","`a1,a7,a3`Neutral","`a1,a2,a5`Negative") Quiz_Date<-as.Date(c("26-01-2017","26-01-2017","28-02-2017","30-10-2017","30-11-2017","28-02-2018","27-01-2017","28-08-2017","28-01-2017"),'%d-%m-%Y') data<-data.frame(UserID,Quiz_answers,Quiz_Date)
-我写了下面的代码
data.removeDuplicates<- function(frames)
{
apply(frames[ ,c(grep("UserID", colnames(data)),grep("Quiz_answers", colnames(data)),grep("Quiz_Date", colnames(data)))],1,function(slice){
Outcome<-paste0("`",tail(strsplit(slice[2],split="`")[[1]],1))
cat("\n\n Searching for records: ",slice[1],Outcome,slice[3])
data<<-data[!( data$UserID == slice[1] & paste0("`",sapply(strsplit(as.character(data[,2]),'`'), tail, 1)) == c(Outcome) & data[,3]==c(slice[3])), ]
})
print(frames)
}
data.removeDuplicates(data)
print(data)
[1] UserID Quiz_answers Quiz_Date
<0 rows> (or 0-length row.names)
我期待输出
UserID Quiz_answers Quiz_Date
1 `a1,a2,a3`Positive 26-01-2017
1 `a1,a2,a4`Negative 28-02-2017
1 `a1,a2,a3`Neutral 30-10-2017
1 `a1,a2,a4`Positive 30-11-2017
1 `a1,a2,a4`Negative 28-02-2018
2 `a1,a2,a3`Negative 27-01-2017
2 `a1,a7,a3`Neutral 28-08-2017
2 `a1,a2,a5`Negative 28-01-2017
根据规则,仅第二行应从DataFrame中删除,它是满足重复条件的唯一行。 我究竟做错了什么?
试试这个
您的资料
df <- read.table(text="UserID Quiz_answers Quiz_Date
1 `a1,a2,a3`Positive 26-01-2017
1 `a1,a4,a3`Positive 26-01-2017
1 `a1,a2,a4`Negative 28-02-2017
1 `a1,a2,a3`Neutral 30-10-2017
1 `a1,a2,a4`Positive 30-11-2017
1 `a1,a2,a4`Negative 28-02-2018
2 `a1,a2,a3`Negative 27-01-2017
2 `a1,a7,a3`Neutral 28-08-2017
2 `a1,a2,a5`Negative 28-01-2017", header = TRUE, stringsAsFactors=FALSE)
解决方案与输出
library(dplyr)
ans <- df %>%
mutate(grp = sub(".*`(\\D+)$", "\\1", Quiz_answers)) %>%
group_by(grp, UserID, Quiz_Date) %>%
slice(1) %>%
ungroup() %>%
select(-grp) %>%
arrange(UserID, Quiz_Date)
# A tibble: 8 x 3
# UserID Quiz_answers Quiz_Date
# <int> <chr> <chr>
# 1 1 `a1,a2,a3`Positive 26-01-2017
# 2 1 `a1,a2,a4`Negative 28-02-2017
# 3 1 `a1,a2,a4`Negative 28-02-2018
# 4 1 `a1,a2,a3`Neutral 30-10-2017
# 5 1 `a1,a2,a4`Positive 30-11-2017
# 6 2 `a1,a2,a3`Negative 27-01-2017
# 7 2 `a1,a2,a5`Negative 28-01-2017
# 8 2 `a1,a7,a3`Neutral 28-08-2017
您可以使用sqldf
包,如下所示。 首先,找到Positive
, Negative
和Neutral
。 然后,使用group by
过滤重复项:
require("sqldf")
result <- sqldf("SELECT * FROM df WHERE Quiz_answers LIKE '%`Positive' GROUP BY UserID, Quiz_Date
UNION
SELECT * FROM df WHERE Quiz_answers LIKE '%`Negative' GROUP BY UserID, Quiz_Date
UNION
SELECT * FROM df WHERE Quiz_answers LIKE '%`Neutral' GROUP BY UserID, Quiz_Date")
运行后的result
是:
UserID Quiz_answers Quiz_Date
1 1 `a1,a2,a3`Neutral 30-10-2017
2 1 `a1,a2,a4`Negative 28-02-2017
3 1 `a1,a2,a4`Negative 28-02-2018
4 1 `a1,a2,a4`Positive 30-11-2017
5 1 `a1,a4,a3`Positive 26-01-2017
6 2 `a1,a2,a3`Negative 27-01-2017
7 2 `a1,a2,a5`Negative 28-01-2017
8 2 `a1,a7,a3`Neutral 28-08-2017
这是两行解决方案,仅使用基数R:
data[,"group"] <- with(data, sub(".*`", "", Quiz_answers))
data <- data[as.integer(rownames(unique(data[, !(names(data) %in% "Quiz_answers") ]))), !(names(data) %in% "group")]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.