简体   繁体   中英

Select last observation of a date variable - SPSS or R

I'm relatively new to R, so I realise this type of question is asked often but I've read a lot of stack overflow posts and still can't quite get something to work on my data.

I have data on spss, in two datasets imported into R. Both of my datasets include an id (IDC), which I have been using to merge them. Before merging, I need to filter one of the datasets to select specifically the last observation of a date variable.

My dataset, d1, has a longitudinal measure in long format. There are multiple rows per IDC, representing different places of residence (neighborhood). Each row has its own "start_date", which is a variable that is NOT necessarily unique.

As it looks on spss:

IDC neighborhood start_date
1 22 08.07.2001
1 44 04.02.2005
1 13 21.06.2010
2 44 24.12.2014
2 3 06.03.2002
3 22 04.01.2006
4 13 08.07.2001
4 2 15.06.2011

In R, the start dates do not look the same, instead they are one long number like "13529462400". I do not understand this format but I assume it still would retain the date order...

Here are all my attempts so far to select the last date. All attempts ran, there was no error. The output just didn't give me what I want. To my perception, none of these made any change in the number of repetitions of IDC, so none of them actually selected *only the last date.

##### attempt 1 --- not working
d1$start_date_filt <- d1$start_date
d1[order(d1$IDC,d1$start_date_filt),]  # Sort by ID and week
d1[!duplicated(d1$IDC, fromLast=T),] # Keep last observation per ID)

###### attempt 2--- not working 
myid.uni <- unique(d1$IDC)
a<-length(myid.uni)

last <- c()

for (i in 1:a) {
  temp<-subset(d1, IDC==myid.uni[i])
  if (dim(temp)[1] > 1) {
    last.temp<-temp[dim(temp)[1],]
  }
  else {
    last.temp<-temp
  }
  last<-rbind(last, last.temp)
}

last

##### atempt 3 -- doesn't work

do.call("rbind",
        by(d1,INDICES = d1$IDC,
      FUN=function(DF)
      DF[which.max(DF$start_date),]))

#### attempt 4 -- doesnt work 
library(plyr)
ddply(d1,.(IDC), function(X)
  X[which.max(X$start_date),])

### merger code -- in case something has to change with that after only the last start_date is selected
merge(d1,d2, IDC)

My goal dataset d1 would look like this:

IDC neighborhood start_date
1 13 21.06.2010
2 44 24.12.2014
3 22 04.01.2006
4 2 15.06.2011

I'm grateful for any help, many thanks <3

There are some problems with most approaches dealing with this data: because your dates are arbitrary strings in a format that does not sort correctly, it just-so-happens to work here because the maximum day-of-month also happens in the maximum year.

It would generally be better to work with that column as a Date object in R, so that comparisons can be better.

dat$start_date <- as.Date(dat$start_date, format = "%d.%m.%Y")
dat
#   IDC neighborhood start_date
# 1   1           22 2001-07-08
# 2   1           44 2005-02-04
# 3   1           13 2010-06-21
# 4   2           44 2014-12-24
# 5   2            3 2002-03-06
# 6   3           22 2006-01-04
# 7   4           13 2001-07-08
# 8   4            2 2011-06-15

From here, things are a bit simpler:

Base R

do.call(rbind, by(dat, dat[,c("IDC"),drop=FALSE], function(z) z[which.max(z$start_date),]))
#   IDC neighborhood start_date
# 1   1           13 2010-06-21
# 2   2           44 2014-12-24
# 3   3           22 2006-01-04
# 4   4            2 2011-06-15

dplyr

dat %>%
  group_by(IDC) %>%
  slice(which.max(start_date)) %>%
  ungroup()
# # A tibble: 4 x 3
#     IDC neighborhood start_date
#   <int>        <int> <date>    
# 1     1           13 2010-06-21
# 2     2           44 2014-12-24
# 3     3           22 2006-01-04
# 4     4            2 2011-06-15

Data

dat <- structure(list(IDC = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L), neighborhood = c(22L, 44L, 13L, 44L, 3L, 22L, 13L, 2L), start_date = c("08.07.2001", "04.02.2005", "21.06.2010", "24.12.2014", "06.03.2002", "04.01.2006", "08.07.2001", "15.06.2011")), class = "data.frame", row.names = c(NA, -8L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM