简体   繁体   中英

Convert age entered as 'X Weeks, Y Days, Z hours' in R

I have an age variable containing observations that follow this (inconsistent) format:

3 weeks, 2 days, 4 hours
4 weeks, 6 days, 12 hours
3 days, 18 hours
4 days, 3 hours
7 hours
8 hours

I need to convert each observation to hours using R.

I have used strsplit(vector, ',') to split the variable at each comma.

I am running trouble because splitting each observation at the ',' yields anywhere from 1 to 3 entries for each observation. I do not know how to properly index these entries so that I end up with one row for each observation.

I am guessing that once I am able to store these values in sensible rows, I can extract the numeric data from each column in a row and convert accordingly, then sum the entire row.

I am also open to any different methods of approaching this problem.

After you split your data you can parse the resulting list for the keywords defining the times like 'hours', 'weeks', 'days' and create a dataframe containing the relevant value (or 0 if there is no value for a certain keyword). You can achieve that with something like this:

library(dplyr)
vector = c("3 weeks, 2 days, 4 hours", "4 weeks, 6 days, 12 hours", "3 days, 18 hours", "4 days, 3 hours", "7 hours", "8 hours")
split_vector = strsplit(vector, ",", fixed = TRUE)


parse_string = function(i){
  x = split_vector[[i]]
  data_frame(ID = i) %>% 
    mutate(hours = ifelse(any(grepl("hours", x)), as.numeric(gsub("\\D", "", x[grepl("hours", x)])), 0),
           days = ifelse(any(grepl("days", x)), as.numeric(gsub("\\D", "", x[grepl("days", x)])), 0),
           weeks = ifelse(any(grepl("weeks", x)), as.numeric(gsub("\\D", "", x[grepl("weeks", x)])), 0))
}

all_parsed = lapply(1:length(split_vector),  parse_string)
all_parsed = rbind_all(all_parsed) %>% 
  mutate(final_hours = hours + days * 24 + weeks * 7 * 24)

Hadleyverse comes to the rescue again:

library(lubridate)
library(stringr)

dat <- readLines(textConnection(" 3   weeks,   2  days,  4 hours
 4 week,  6 days,  12 hours 
3 days, 18 hours
4 day, 3 hours
 7 hours
8  hour"))

sapply(str_split(str_trim(dat), ",[ ]*"), function(x) {
  sum(sapply(x, function(y) {
    bits <- str_split(str_trim(y), "[ ]+")[[1]]
    duration(as.numeric(bits[1]), bits[2])
  })) / 3600
})

## [1] 556 828  90  99   7   8

I whacked the data a bit to show it's also somewhat flexible in how it parses things. I rly don't think the second str_trim is absolutely necessary but didn't have cycles to verify.

The exposition is that it trims the original vector then splits it into components (which makes a list of vectors). That list is then iterated over and the individual vector elements are further trimmed and split into # and unit duration. That's passed to lubridate and the value is returned and automatically converted to numeric seconds by the call to sum and we then make it into hours.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM