[英]Identify and remove duplicates by a criteria in R
嗨,我对R中的重复项感到困惑。我四处张望,似乎找不到任何帮助。 我有一个像这样的数据集
x = data.frame( id = c("A","A","A","A","A","A","A","B","B","B","B"),
StartDate = c("09/07/2006", "09/07/2006", "09/07/2006", "08/10/2006",
"08/10/2006", "09/04/2007", "02/03/2011","05/05/2005", "08/06/2009", "07/09/2009", "07/09/2009"),
EndDate = c("06/08/2006", "06/08/2006", "06/08/2006", "19/11/2006", "19/11/2006", "07/05/2007", "30/03/2011",
"02/06/2005", "06/07/2009", "05/10/2009", "05/10/2009"),
Group = c(1,1,1,2,2,3,4,2,3,4,4),
TestDate = c("09/06/2006", "08/09/2006", "08/10/2006", "08/09/2006", "08/10/2006", "NA", "02/03/2011",
"NA", "07/09/2009", "07/09/2009", "08/10/2009"),
Code = c(4,4,4858,4,4858,NA,4,NA, 795, 795, 4)
)
> x
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
2 A 09/07/2006 06/08/2006 1 08/09/2006 4
3 A 09/07/2006 06/08/2006 1 08/10/2006 4858
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
9 B 08/06/2009 06/07/2009 3 07/09/2009 795
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
因此,基本上我想做的是通过ID识别TestDate变量中的重复项。 例如,日期08/09/2006和08/10/2006似乎在同一个人中重复,但是对于不同的组,我不希望同一Testdate通过ID在不同的组中。 选择哪个TestDate的标准是将不同组的TestDate的天数与StartDate和EndDate的差值相乘,然后将天数差异最小的那一个保留。 例如,关于日期08/10/2006,我想保留第5行,因为与第3行中的相同差异相比,TestDate与StartDate相比更接近StartDate。最终,我希望获得这样的数据集
> xfinal
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
任何帮助,将不胜感激。 谢谢
x$StartDate <- as.Date(x$StartDate,format="%d/%m/%Y")
x$EndDate <- as.Date(x$EndDate,format="%d/%m/%Y")
x$TestDate <- as.Date(x$TestDate,format="%d/%m/%Y")
x$Diff <- difftime(x$EndDate,x$StartDate,"days")
x <- x[order(x$id,x$Diff),]
x <- x[!duplicated(x[,c("id","TestDate")]),]
x$Diff <- NULL
x
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.