I wonder if there is a way to apply a function to each row of a data.frame such that the column classes are preserved? Let's look at an example to clarify what I mean:
test <- data.frame(startdate = as.Date(c("2010-03-07", "2013-09-13", "2011-11-12")),
enddate = as.Date(c("2010-03-23", "2013-12-01", "2012-01-05")),
nEvents = c(123, 456, 789))
Suppose I would like to expand the data.frame test
by inserting all days between startdate
and enddate
and distribute the number of events over those days. My first try to do so was this:
eventsPerDay1 <- function(row) {
n_days <- as.numeric(row$enddate - row$startdate) + 1
data.frame(date = seq(row$startdate, row$enddate, by = "1 day"),
nEvents = rmultinom(1, row$nEvents, rep(1/n_days, n_days)))
}
apply(test, 1, eventsPerDay1)
This, however, is not possible because apply
calls as.matrix
on test
and thus it gets converted to a character matrix and all column classes are lost.
I already found two workarounds which you can find below, so my question is more of a philosphical nature.
library(magrittr)
############# Workaround 1
eventsPerDay2 <- function(startdate, enddate, nEvents) {
n_days <- as.numeric(enddate - startdate) + 1
data.frame(date = seq(startdate, enddate, by = "1 day"),
nEvents = rmultinom(1, nEvents, rep(1/n_days, n_days)))
}
mapply(eventsPerDay2, test$startdate, test$enddate, test$nEvents, SIMPLIFY = F) %>%
do.call(rbind, .)
############# Workaround 2
seq_along(test) %>%
lapply(function(i) test[i, ]) %>%
lapply(eventsPerDay1) %>%
do.call(rbind, .)
My "problem" with the workarounds is the following:
mapply
. It has a different signature than the other *apply
functions (as the the order of arguments differs) and I always feel that a for
loop would just have been clearer. So does anyone know a function whose call would look like apply(test, 1, eventsPerDay1)
and that will work?
We can do this with data.table
library(data.table)
res <- setDT(test)[,n_days := as.numeric(enddate - startdate) + 1
][, .(date = seq(startdate, enddate, by= "1 day"),
nEvents = c(rmultinom(1, nEvents, rep(1/n_days, n_days)))),
by = 1:nrow(test)][, nrow := NULL]
str(res)
#Classes ‘data.table’ and 'data.frame': 152 obs. of 2 variables:
# $ date : Date, format: "2010-03-07" "2010-03-08" "2010-03-09" "2010-03-10" ...
# $ nEvents: int 5 9 7 11 6 6 10 7 12 3 ...
The above can be wrapped in a function
eventsPerDay <- function(dat){
as.data.table(dat)[, n_days:= as.numeric(enddate - startdate) + 1
][, .(date = seq(startdate, enddate, by= "1 day"),
nEvents = c(rmultinom(1, nEvents, rep(1/n_days, n_days)))) , 1:nrow(dat)
][, nrow := NULL][]
}
eventsPerDay(test)
Another idea:
library(dplyr)
library(tidyr)
test %>%
mutate(id = row_number()) %>%
group_by(startdate) %>%
complete(startdate = seq.Date(startdate, enddate, 1), nesting(id)) %>%
group_by(id) %>%
mutate(nEvents = rmultinom(1, first(nEvents), rep(1/n(), n()))) %>%
select(startdate, nEvents)
Which gives:
#Source: local data frame [152 x 3]
#Groups: id [3]
#
# id startdate nEvents
# <int> <date> <int>
#1 1 2010-03-07 6
#2 1 2010-03-08 6
#3 1 2010-03-09 6
#4 1 2010-03-10 7
#5 1 2010-03-11 12
#6 1 2010-03-12 5
#7 1 2010-03-13 8
#8 1 2010-03-14 5
#9 1 2010-03-15 5
#10 1 2010-03-16 9
## ... with 142 more rows
I have asked myself the same question.
I either end up splitting the df into a list (the base way)
xy <- data.frame()
xy.list <- split(xy, 1:nrow(xy))
out <- lapply(xy.list, function(x) ...)
answer <- unlist(out)
or try the hadleyverse dplyr way using rowwise (the blackbox way)
xy %>%
rowwise() %>%
mutate(newcol = function(x) ....)
I agree that their should be a base implementation of apply(xy, 1, function(x)) that doesn't coerce into character, but I imagine the R ancients implemented the matrix conversion for an advanced reason my primitive mind can't understand.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.