简体   繁体   中英

gsub for dealing with dates in r in data

I am dealing with very large data set of university students where dates are in the form

%d/%m/%y

I need to work out ages.

My data looks something like this as it was pulled from a database:

data <- data.table(DOB=c("12/12/01", "8/05/80", "2/11/99"), 
                  started =c("5/10/10", "4/01/12", "27/08/11"))

The problem is that for calculating ages the whole year is not specified.

I have tried changing the years to numeric:

data$DOB<-as.Date(data$DOB, "%d/%m/%y")
data$start<-as.Date(data$start, "%d/%m/%y")
data$DOB<-as.numeric(format(data$DOB,"%Y"))
data$start<-as.numeric(format(data$start,"%Y"))
data$age<-data$start-data$dob

Obviously this does not work as I need to add in the 20 and 19.

Is there a way I can use gsub to put a '20' in front of all the where the dob is less than or equal to 15 and a '19' in front of all the dob is more than 15.

I don't think there are any 85 year olds in my dataset.

data<-data.frame(DOB=c('12/12/01', '8/05/80', '2/11/99'), 
                 started =c('5/10/10', '4/01/12', '27/08/11'))

library(stringr)
toFourYear <- function(x){
  x <- str_split(x, "/")
  x <- lapply(x,
         function(t){
            t[3] <- if (as.numeric(t[3]) < 15) paste0("20", t[3]) else paste0("19", t[3])
           t
         })    
  x <- vapply(x, paste0, character(1), collapse = "/")
  x
}

data$DOB <- toFourYear(data$DOB)
data$started <- toFourYear(data$started)

Will this work for you?

And a similar approach using the substr and nchar functions of base R.

library(data.table)

dt <-data.table(DOB=c("12/12/01", "8/05/80", "2/11/99"), 
                started =c("5/10/10", "4/01/12", "27/08/11"))

dt

#         DOB  started
# 1: 12/12/01  5/10/10
# 2:  8/05/80  4/01/12
# 3:  2/11/99 27/08/11


WholeYear = function(x){

            v1 = substr(x, 1, nchar(x)-2)
            v2 = substr(x, nchar(x)-1, nchar(x))

            ifelse(as.numeric(v2) <= 15, paste0(v1,"20",v2), paste0(v1,"19",v2)) 

                        }


dt$DOB = sapply(dt$DOB, WholeYear)
dt$started = sapply(dt$started, WholeYear)

dt


#           DOB    started
# 1: 12/12/2001  5/10/2010
# 2:  8/05/1980  4/01/2012
# 3:  2/11/1999 27/08/2011

Or, avoiding additional pkg use and doing vectorized date vs string manipulation:

dat <- data.table(DOB=c("12/12/01", "8/05/80", "2/11/99"), 
                  started =c("5/10/10", "4/01/12", "27/08/11"))

#' Convert a vector of date strings (with 2-digit years) into dates, taking
#' into account a "cutoff" year to demark when a date belongs in one 
#' century or another.
#'
#' @param d vector of character strings
#' @param format date string format for the 'd'
#' @param cutoff_year 2-digit year where dates in 'd' will be considered
#'        part of one century or another
#' @param output_format date format for the output character vector
as_date_with_cutoff <- function(d, format="%d/%m/%y", 
                                cutoff_year=15, output_format="%d/%m/%Y") {

  d <- as.Date(d, format)

  d <- as.Date(ifelse(d < sprintf("19%s-12-31", cutoff_year), 
                     format(d, "19%y-%m-%d"), format(d)))

  as.character(format(d, output_format))

}

# orig
dat
##         DOB  started
## 1: 12/12/01  5/10/10
## 2:  8/05/80  4/01/12
## 3:  2/11/99 27/08/11

dat$DOB <- as_date_with_cutoff(dat$DOB)
dat$started <- as_date_with_cutoff(dat$started)

# converted
dat
##           DOB    started
## 1: 12/12/2001 05/10/2010
## 2: 08/05/1980 04/01/2012
## 3: 02/11/1999 27/08/2011

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM