簡體   English   中英

通過R中的條件識別和刪除重復項

[英]Identify and remove duplicates by a criteria in R

嗨,我對R中的重復項感到困惑。我四處張望,似乎找不到任何幫助。 我有一個像這樣的數據集

x = data.frame( id = c("A","A","A","A","A","A","A","B","B","B","B"),
                StartDate = c("09/07/2006", "09/07/2006", "09/07/2006", "08/10/2006", 
                              "08/10/2006", "09/04/2007", "02/03/2011","05/05/2005", "08/06/2009", "07/09/2009", "07/09/2009"),
                EndDate = c("06/08/2006", "06/08/2006", "06/08/2006", "19/11/2006", "19/11/2006", "07/05/2007", "30/03/2011",
                            "02/06/2005", "06/07/2009", "05/10/2009", "05/10/2009"),
                Group = c(1,1,1,2,2,3,4,2,3,4,4),
                TestDate = c("09/06/2006", "08/09/2006", "08/10/2006", "08/09/2006", "08/10/2006", "NA", "02/03/2011",
                              "NA", "07/09/2009", "07/09/2009", "08/10/2009"),
                Code = c(4,4,4858,4,4858,NA,4,NA, 795, 795, 4)
              )

> x
   id  StartDate    EndDate Group   TestDate Code
1   A 09/07/2006 06/08/2006     1 09/06/2006    4
2   A 09/07/2006 06/08/2006     1 08/09/2006    4
3   A 09/07/2006 06/08/2006     1 08/10/2006 4858
4   A 08/10/2006 19/11/2006     2 08/09/2006    4
5   A 08/10/2006 19/11/2006     2 08/10/2006 4858
6   A 09/04/2007 07/05/2007     3         NA   NA
7   A 02/03/2011 30/03/2011     4 02/03/2011    4
8   B 05/05/2005 02/06/2005     2         NA   NA
9   B 08/06/2009 06/07/2009     3 07/09/2009  795
10  B 07/09/2009 05/10/2009     4 07/09/2009  795
11  B 07/09/2009 05/10/2009     4 08/10/2009    4

因此,基本上我想做的是通過ID識別TestDate變量中的重復項。 例如,日期08/09/2006和08/10/2006似乎在同一個人中重復,但是對於不同的組,我不希望同一Testdate通過ID在不同的組中。 選擇哪個TestDate的標准是將不同組的TestDate的天數與StartDate和EndDate的差值相乘,然后將天數差異最小的那一個保留。 例如,關於日期08/10/2006,我想保留第5行,因為與第3行中的相同差異相比,TestDate與StartDate相比更接近StartDate。最終,我希望獲得這樣的數據集

> xfinal
   id  StartDate    EndDate Group   TestDate Code
1   A 09/07/2006 06/08/2006     1 09/06/2006    4
4   A 08/10/2006 19/11/2006     2 08/09/2006    4
5   A 08/10/2006 19/11/2006     2 08/10/2006 4858
6   A 09/04/2007 07/05/2007     3         NA   NA
7   A 02/03/2011 30/03/2011     4 02/03/2011    4
8   B 05/05/2005 02/06/2005     2         NA   NA
10  B 07/09/2009 05/10/2009     4 07/09/2009  795
11  B 07/09/2009 05/10/2009     4 08/10/2009    4

任何幫助,將不勝感激。 謝謝

x$StartDate <- as.Date(x$StartDate,format="%d/%m/%Y")
x$EndDate <- as.Date(x$EndDate,format="%d/%m/%Y")
x$TestDate <- as.Date(x$TestDate,format="%d/%m/%Y")
x$Diff <- difftime(x$EndDate,x$StartDate,"days")

x <- x[order(x$id,x$Diff),]

x <- x[!duplicated(x[,c("id","TestDate")]),]
x$Diff <- NULL
x

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM