简体   繁体   中英

Iterate over column names in an R data frame in order to change their type

library(lubridate)

# data to build the df
d1 <- c("1/2/14", "3/5/15", "1/13/11") #start
d2 <- c("1/2/15", "4/5/15", "6/18/15") #stop
d3 <- c("5/16/08", "1/7/07", "6/22/01") #start
d4 <- c("11/29/12", "8/5/14", "1/13/12") #stop
a <- c("Blah", "Blah", "Blah")
b <- c("Blah", "Blah", "Blah")
c <- c("Blah", "Blah", "Blah")
f <- c("Blah", "Blah", "Blah")
colNames <- c("Col.a", "Col.b", "Col.c", "Project1.start", "Project1.end", "Project2.start", "Project2.end", "Col.f")

# assemble the df
df <- data.frame(a,b,c,d1,d2,d3,d4,f)
names(df) <- colNames

# change the char cols for dX into POSIX date objects to play nicely with    
# lubridate
df$Project1.start <- mdy(df$Project1.start)
df$Project1.end <- mdy(df$Project1.end)
df$Project2.start <- mdy(df$Project2.start)
df$Project2.end <- mdy(df$Project2.end)

BUT! I want to do the above mdy iteratively over the dX that I specify. Imagine that instead of d1-d4 I have d1-d142. There must be an elegant, ie, non-brute force way of doing this!

so, I tried this. I know that I'm doing mdy on too many columns, but I am just trying to make it work at all. I've tried for loops with seq() , etc., but I know that I'm missing the vector based approach that R expects.

f <- function(x) {x <- mdy(x)}
newdf <- apply(df,2,f)

but it throws

Warning messages:
1: All formats failed to parse. No formats found. 
...
10: All formats failed to parse. No formats found. 

and the newdf is bad:

     Col.a Col.b Col.c Project1.start Project1.end Project2.start Project2.end Col.f
[1,]    NA    NA    NA             NA           NA             NA           NA    NA
[2,]    NA    NA    NA             NA           NA             NA           NA    NA
[3,]    NA    NA    NA             NA           NA             NA           NA    NA

       Project1.duration Project2.duration
[1,]                NA                NA
[2,]                NA                NA
[3,]                NA                NA

What am I doing that is just so st00pid?

So, once that is done, we want to do some date math

df$Project1.duration <- (df$Project1.end - df$Project1.start )
df$Project2.duration <- (df$Project2.end - df$Project2.start )

same here. I want to be able to iterate over all the durations for all the dX columns but perhaps I need to reshape the data to make this happen. How would you take this large number of durations for all of these different projects that are separately coded and reassemble them into a df so that I can make a plot of the different durations for each project. In my sample df I have three different durations, rows 1:3, to be able to compare the rows for each project.

Your error is because your apply is applying mdy to every column of df , not just the "ProjectX.{start,end}" ones. And also because df[col] is a data.frame , and mdy needs a vector -- try df[[col]] .

eg

cols <- grep('Project', names(df))
# do a one-liner like this
df[cols] <- lapply(df[cols], mdy)
# or a loop like this if you want
for (col in cols) {
    df[[col]] <- mdy(df[[col]])
}

In regards to calculating per-project data (like duration), you can kludge it like this:

projects <- paste0('Project', 1:2) # however many projects
df[paste0(projects, '.duration')] <- df[paste0(projects, '.end')] - df[paste0(projects, '.start')]

However in the long run (particularly if you have lots of projects or want to calculate lots of stats per project, not just duration) you might consider having your data in long format, ie

Project  start  end duration
 1       ...
 1
 1
 2
 2
 2

(probably with some sort of ID variable so you know which project 2 went with which project 1)

Then you can easily do mydf$duration <- mydf$end - mydf$start and if you want it in wide format again you can make use of reshape .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM