简体   繁体   English

通过R中的条件识别和删除重复项

[英]Identify and remove duplicates by a criteria in R

Hi I am puzzled with a problem concerning duplicates in R. I have looked around a lot and don't seem to find any help. 嗨,我对R中的重复项感到困惑。我四处张望,似乎找不到任何帮助。 I have a dataset like that 我有一个像这样的数据集

x = data.frame( id = c("A","A","A","A","A","A","A","B","B","B","B"),
                StartDate = c("09/07/2006", "09/07/2006", "09/07/2006", "08/10/2006", 
                              "08/10/2006", "09/04/2007", "02/03/2011","05/05/2005", "08/06/2009", "07/09/2009", "07/09/2009"),
                EndDate = c("06/08/2006", "06/08/2006", "06/08/2006", "19/11/2006", "19/11/2006", "07/05/2007", "30/03/2011",
                            "02/06/2005", "06/07/2009", "05/10/2009", "05/10/2009"),
                Group = c(1,1,1,2,2,3,4,2,3,4,4),
                TestDate = c("09/06/2006", "08/09/2006", "08/10/2006", "08/09/2006", "08/10/2006", "NA", "02/03/2011",
                              "NA", "07/09/2009", "07/09/2009", "08/10/2009"),
                Code = c(4,4,4858,4,4858,NA,4,NA, 795, 795, 4)
              )

> x
   id  StartDate    EndDate Group   TestDate Code
1   A 09/07/2006 06/08/2006     1 09/06/2006    4
2   A 09/07/2006 06/08/2006     1 08/09/2006    4
3   A 09/07/2006 06/08/2006     1 08/10/2006 4858
4   A 08/10/2006 19/11/2006     2 08/09/2006    4
5   A 08/10/2006 19/11/2006     2 08/10/2006 4858
6   A 09/04/2007 07/05/2007     3         NA   NA
7   A 02/03/2011 30/03/2011     4 02/03/2011    4
8   B 05/05/2005 02/06/2005     2         NA   NA
9   B 08/06/2009 06/07/2009     3 07/09/2009  795
10  B 07/09/2009 05/10/2009     4 07/09/2009  795
11  B 07/09/2009 05/10/2009     4 08/10/2009    4

So basically what I am trying to do is to identify duplicates in the TestDate variable by ID. 因此,基本上我想做的是通过ID识别TestDate变量中的重复项。 For example dates 08/09/2006 and 08/10/2006 seem to be repeated in the same person but for different Group and I don't want the same Testdate to be in different Group by ID. 例如,日期08/09/2006和08/10/2006似乎在同一个人中重复,但是对于不同的组,我不希望同一Testdate通过ID在不同的组中。 The criteria to choose which TestDate to choose is to take the difference in days of TestDate with StartDate and EndDate for the different groups and then keep the one with the smallest difference in days. 选择哪个TestDate的标准是将不同组的TestDate的天数与StartDate和EndDate的差值相乘,然后将天数差异最小的那一个保留。 For example, about the date 08/10/2006 I would like to keep row 5 as the TestDate there is closer to the StartDate, than compared with the same differences in row 3. Eventually, I would like to get with a dataset like that 例如,关于日期08/10/2006,我想保留第5行,因为与第3行中的相同差异相比,TestDate与StartDate相比更接近StartDate。最终,我希望获得这样的数据集

> xfinal
   id  StartDate    EndDate Group   TestDate Code
1   A 09/07/2006 06/08/2006     1 09/06/2006    4
4   A 08/10/2006 19/11/2006     2 08/09/2006    4
5   A 08/10/2006 19/11/2006     2 08/10/2006 4858
6   A 09/04/2007 07/05/2007     3         NA   NA
7   A 02/03/2011 30/03/2011     4 02/03/2011    4
8   B 05/05/2005 02/06/2005     2         NA   NA
10  B 07/09/2009 05/10/2009     4 07/09/2009  795
11  B 07/09/2009 05/10/2009     4 08/10/2009    4

Any help on that will be much appreciated. 任何帮助,将不胜感激。 Thanks 谢谢

x$StartDate <- as.Date(x$StartDate,format="%d/%m/%Y")
x$EndDate <- as.Date(x$EndDate,format="%d/%m/%Y")
x$TestDate <- as.Date(x$TestDate,format="%d/%m/%Y")
x$Diff <- difftime(x$EndDate,x$StartDate,"days")

x <- x[order(x$id,x$Diff),]

x <- x[!duplicated(x[,c("id","TestDate")]),]
x$Diff <- NULL
x

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM