简体   繁体   English

将函数应用于data.frame的每一行并保留列类

[英]Apply function to each row of data.frame and preserve column classes

I wonder if there is a way to apply a function to each row of a data.frame such that the column classes are preserved? 我想知道是否有一种方法可以将一个函数应用于data.frame的每一行,以便保留列类? Let's look at an example to clarify what I mean: 让我们看一个例子来阐明我的意思:

test <- data.frame(startdate = as.Date(c("2010-03-07", "2013-09-13", "2011-11-12")),
                   enddate = as.Date(c("2010-03-23", "2013-12-01", "2012-01-05")),
                   nEvents = c(123, 456, 789))

Suppose I would like to expand the data.frame test by inserting all days between startdate and enddate and distribute the number of events over those days. 假设我想扩大data.frame test通过插入之间的所有天startdateenddate ,并分布在那些日子里,事件的数量。 My first try to do so was this: 我第一次尝试这样做是:

eventsPerDay1 <- function(row) {
    n_days <- as.numeric(row$enddate - row$startdate) + 1
    data.frame(date = seq(row$startdate, row$enddate, by = "1 day"),
               nEvents = rmultinom(1, row$nEvents, rep(1/n_days, n_days)))
}

apply(test, 1, eventsPerDay1)

This, however, is not possible because apply calls as.matrix on test and thus it gets converted to a character matrix and all column classes are lost. 但是,这是不可能的,因为在test apply as.matrix调用,因此将其转换为字符矩阵,并且所有列类均丢失。

I already found two workarounds which you can find below, so my question is more of a philosphical nature. 我已经找到了两个解决方法,您可以在下面找到它们,因此我的问题更多是哲学性质的。

library(magrittr)
############# Workaround 1
eventsPerDay2 <- function(startdate, enddate, nEvents) {
    n_days <- as.numeric(enddate - startdate) + 1
    data.frame(date = seq(startdate, enddate, by = "1 day"),
               nEvents = rmultinom(1, nEvents, rep(1/n_days, n_days)))
}

mapply(eventsPerDay2, test$startdate, test$enddate, test$nEvents, SIMPLIFY = F) %>%
    do.call(rbind, .)


############# Workaround 2
seq_along(test) %>%
    lapply(function(i) test[i, ]) %>%
    lapply(eventsPerDay1) %>%
    do.call(rbind, .)

My "problem" with the workarounds is the following: 我的解决方法“问题”如下:

  • Workaround 1: It may not be the best reason, but I simply do not like mapply . 解决方法1:可能不是最好的原因,但是我根本不喜欢mapply It has a different signature than the other *apply functions (as the the order of arguments differs) and I always feel that a for loop would just have been clearer. 它具有与其他*apply函数不同的签名(因为参数的顺序不同),我始终觉得for循环会更清楚。
  • Workaround 2: While being very flexible, I think it is not clear at first sight what is happening. 解决方法2:虽然非常灵活,但我乍一看还不清楚发生了什么。

So does anyone know a function whose call would look like apply(test, 1, eventsPerDay1) and that will work? 那么,有谁知道一个函数的调用看起来像apply(test, 1, eventsPerDay1)并且可以正常工作吗?

We can do this with data.table 我们可以使用data.table来做到这data.table

library(data.table)
res <- setDT(test)[,n_days := as.numeric(enddate - startdate) + 1 
           ][, .(date = seq(startdate, enddate, by= "1 day"),
          nEvents = c(rmultinom(1, nEvents, rep(1/n_days, n_days)))),
        by =  1:nrow(test)][, nrow := NULL]
str(res)
#Classes ‘data.table’ and 'data.frame':  152 obs. of  2 variables:
# $ date   : Date, format: "2010-03-07" "2010-03-08" "2010-03-09" "2010-03-10" ...
# $ nEvents: int  5 9 7 11 6 6 10 7 12 3 ...

The above can be wrapped in a function 上面可以包装一个函数

eventsPerDay <- function(dat){  
      as.data.table(dat)[, n_days:= as.numeric(enddate - startdate) + 1
       ][, .(date = seq(startdate, enddate, by= "1 day"),
    nEvents = c(rmultinom(1, nEvents, rep(1/n_days, n_days)))) , 1:nrow(dat)
        ][, nrow := NULL][]
  }

eventsPerDay(test)

Another idea: 另一个想法:

library(dplyr)
library(tidyr)

test %>%
  mutate(id = row_number()) %>%
  group_by(startdate) %>%
  complete(startdate = seq.Date(startdate, enddate, 1), nesting(id)) %>%
  group_by(id) %>%
  mutate(nEvents = rmultinom(1, first(nEvents), rep(1/n(), n()))) %>%
  select(startdate, nEvents)

Which gives: 这使:

#Source: local data frame [152 x 3]
#Groups: id [3]
#
#      id  startdate nEvents
#   <int>     <date>   <int>
#1      1 2010-03-07       6
#2      1 2010-03-08       6
#3      1 2010-03-09       6
#4      1 2010-03-10       7
#5      1 2010-03-11      12
#6      1 2010-03-12       5
#7      1 2010-03-13       8
#8      1 2010-03-14       5
#9      1 2010-03-15       5
#10     1 2010-03-16       9
## ... with 142 more rows

I have asked myself the same question. 我问过自己同样的问题。

I either end up splitting the df into a list (the base way) 我要么最终将df拆分成一个列表(基本方式)

xy <- data.frame()
xy.list <- split(xy, 1:nrow(xy))
out <- lapply(xy.list, function(x) ...)
answer <- unlist(out)

or try the hadleyverse dplyr way using rowwise (the blackbox way) 或尝试使用逐行的hadleyverse dplyr方法(黑盒方法)

xy %>%
rowwise() %>%
mutate(newcol = function(x) ....)

I agree that their should be a base implementation of apply(xy, 1, function(x)) that doesn't coerce into character, but I imagine the R ancients implemented the matrix conversion for an advanced reason my primitive mind can't understand. 我同意他们应该是apply(xy,1,function(x))的基本实现,不会强制转换为字符,但是我想R古代人出于原始人无法理解的高级原因实现了矩阵转换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM