简体   繁体   English

为用户定义的函数构建高效的for循环:data.table

[英]building efficient for loop for user defined function: data.table

I'm trying to build an efficient for loop for this function proposed by minem here: ( Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table ) 我正在尝试为此处的minem提出的功能构建高效的for循环:( Data.table:如何获取其承诺的快速子集并将其应用于第二个data.table

My data are: 我的数据是:

library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)

lsr <- cbind.data.frame(
  c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
  c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
  c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)

adherence <- as.data.table(adherence)

The Function proposed by minem are: minem提出的功能是:

by_minem2 <- function(dt = lsr2) {
  d <- as.numeric(as.Date("2013-02-01"))
  dt[, ENDDATE2 := as.numeric(ENDDATE)]
  x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
  uid <- unique(dt$ID)
  id2 <- setdiff(uid, x$ID)
  id2 <- uid[!(uid %in% x$ID)]
  x2 <- data.table(ID = id2, V1 = 0)
  x <- rbind(x, x2)
  setkey(x, ID)
  x
}

This returns: 返回:

> by_minem2(lsr)
   ID V1
1:  1 64
2:  2  0
3:  3 63

For the loop i need to include information about which time I evaluated at so the ideal repeated output looks like this: 对于循环,我需要包含有关我评估的时间的信息,因此理想的重复输出如下所示:

cbind(as.Date("2013-02-01"),by_minem2(lsr))

I then want to repeat this for different dates a few hundred times putting everything into the same data.table: 然后,我想针对不同的日期重复此操作几百次,将所有内容都放入相同的data.table中:

time.months <- as.Date("2013-02-01")+(365.25/12)*(0:192) #dates to evaluate at

I'm trying to do this with a for loop like this: 我试图用这样的for循环来做到这一点:

     for (d in min(time.months):max(time.months))
{
  by_minem <- function(dt = lsr2) {
    d <- as.numeric(d)
    dt[, ENDDATE2 := as.numeric(ENDDATE)]
    x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
    uid <- unique(dt$ID)
    id2 <- setdiff(uid, x$ID)
    id2 <- uid[!(uid %in% x$ID)]
    x2 <- data.table(ID = id2, V1 = 0)
    x <- rbind(x, x2)
    setkey(x, ID)
    xtot <- append(xtot,x) 
    xtot <- cbind(d, xtot) # i need to know time of evaluation
    xtot
  }
}

something like this : 像这样的东西:

dt <- lsr
dt[, ENDDATE2 := as.numeric(ENDDATE)]
s <- time.months
xtot <- lapply(s, function(d) {
  d <- as.numeric(d)
  x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
  uid <- unique(dt$ID)
  id2 <- setdiff(uid, x$ID)
  id2 <- uid[!(uid %in% x$ID)]
  if (length(id2) > 0) {
    x2 <- data.table(ID = id2, V1 = 0)
    x <- rbind(x, x2)
  }
  setkey(x, ID)
  x
})
for (x in seq_along(xtot)) {
  setnames(xtot[[x]], c("ID", paste0("V", x)))
}

xtot <- Reduce(function(...) merge(..., all = TRUE, by = "ID"), xtot)
xtot

As indicated in the answer to the related question Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table , this can be solved by updating in a non-equi join which is possible with data.table . 如对相关问题Data.table的回答所示:如何获取其承诺的快速子集并将其应用于第二个data.table ,这可以通过 data.table中进行更新的非等 data.table

The difference to the linked question is that here we need to create the cross join CJ() of all unique ID s with the vector of dates on our own before joining with lsr . 与链接的问题的区别在于,在与lsr联接之前,我们需要自行创建具有日期向量的所有唯一ID的交叉联接CJ()

The OP has provided a series of dates time.months whose defintion OP提供了一系列日期time.months其定义

time.months <- as.Date("2013-02-01")+(365.25/12)*(0:192) #dates to evaluate at

leads to "crooked" dates which is only visible if coerced to numeric or POSIXct: 导致“弯曲的”日期,该日期仅在强制为数字或POSIXct时可见:

head(lubridate::as_datetime(time.months))
 [1] "2013-02-01 00:00:00 UTC" "2013-03-03 10:30:00 UTC" "2013-04-02 21:00:00 UTC" [4] "2013-05-03 07:30:00 UTC" "2013-06-02 18:00:00 UTC" "2013-07-03 04:30:00 UTC" 

The issue is that these "dates" are not aligned with midnight but start somewhere during the day. 问题是这些“日期”不与午夜一致,而是在白天开始。 To avoid these ambiguities, the seq() function can be used 为了避免这些歧义,可以使用seq()函数

dates <- seq(as.Date("2013-02-01"), length.out = 193, by = "month")

which creates a series of dates starting on the first day of each month. 这会创建一系列从每个月的第一天开始的日期。

In addition, data.table 's IDate class is used which stores dates as integers (4 bytes) instead of double (8 bytes). 另外,使用data.tableIDate类将日期存储为整数(4个字节)而不是双IDate (8个字节)。 This saves memory as well as processing time because the usually faster integer arithmetic can be used. 因为可以使用通常更快的整数运算,所以可以节省内存和处理时间。

# coerce Date to IDate
idates <- as.IDate(dates)
setDT(lsr)[, eksd := as.IDate(eksd)][, ENDDATE := as.IDate(ENDDATE)]

# cross join unique IDs with dates 
CJ(ID = lsr$ID, date = idates, unique = TRUE)[
  # intialize result column
  , AH := 0L][
    # non-equi join and ...
    lsr, on = .(ID, date >= eksd, date < ENDDATE), 
    # ... update only matching rows
    AH := as.integer(ENDDATE - x.date)][
      # reshape from long to wide format
      , dcast(.SD, ID ~ date)]
  ID 2013-02-01 2013-03-01 2013-04-01 2013-05-01 2013-06-01 2013-07-01 2013-08-01 [...] 1: 1 64 36 5 0 0 0 0 2: 2 0 0 110 80 49 19 0 3: 3 63 35 4 0 0 0 0 

Caveat 警告

Note that above code assumes that the intervals [eksd, ENDDATE) for each ID do not overlap. 请注意,以上代码假定每个ID的间隔[eksd, ENDDATE) 重叠。 This can be verified by 这可以通过验证

lsr[order(eksd), all(eksd - shift(ENDDATE, fill = 0) > 0), keyby = ID]
  ID V1 1: 1 TRUE 2: 2 TRUE 3: 3 TRUE 

In case there are overlaps, the above code can be modified to aggregate within the non-equi join using by = .EACHI . 如果存在重叠,则可以使用by = .EACHI修改上述代码以在非by = .EACHI

Benchmark 基准

In another related question data.table by = xx How do i keep the groups of length 0 when i returns no match , the OP has pointed out that performance is crucial due to the size of his production data. 在另一个相关的问题data.table by = xx中,当我不返回匹配项时如何保持长度为0的组 ,OP指出,由于其生产数据的大小,性能至关重要。

According to OP's comment , lsr has 20 mio rows and 12 columns, the adherence dataset, that I'm trying not to use has 1,5 mio rows of 2 columns. 根据OP的评论lsr有20个mio行和12列,我尝试不使用的adherence数据集有2个列的1,5 mio行。 In another question , the OP mentions that lsr is a few hundred mio. 在另一个问题中 ,OP提到lsr是几百个mio。 rows .

@minem has responded to this by providing a benchmark in his answer . @minem通过在回答中提供基准来对此做出回应。 We can use this benchmark data to compare the different answers. 我们可以使用此基准数据比较不同的答案。

# create benchmark data
lsr <- data.frame(
  ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
  eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
  DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
  stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD
n <- 5e4
lsr2 <- lapply(1:n, function(x) lsr)
lsr2 <- rbindlist(lsr2, use.names = T, fill = T, idcol = T)
lsr2[, ID := as.integer(paste0(.id, ID))]

Thus, the benchmark dataset consists of 400 k rows and 150 k unique ID s: 因此,基准数据集由400 k行和150 k唯一ID

lsr2[, .(.N, uniqueN(ID))]
  N V2 1: 400000 150000 
# pull data preparation out of the benchmark 
lsr2i <- copy(lsr2)[, eksd := as.IDate(eksd)][, ENDDATE := as.IDate(ENDDATE)]
lsr2[, ENDDATE2 := as.numeric(ENDDATE)]

# define date series
dates <- seq(as.Date("2013-02-01"), length.out = 193, by = "month")
idates <- seq(as.IDate("2013-02-01"), length.out = 193, by = "month")

# run benchmark
library(microbenchmark)
bm <- microbenchmark(
  minem = {
    dt <- copy(lsr2)
    xtot <- lapply(dates, function(d) {
      d <- as.numeric(d)
      x <- dt[eksd <= d & ENDDATE > d, sum(ENDDATE2 - d), keyby = ID]
      uid <- unique(dt$ID)
      id2 <- setdiff(uid, x$ID)
      id2 <- uid[!(uid %in% x$ID)]
      if (length(id2) > 0) {
        x2 <- data.table(ID = id2, V1 = 0)
        x <- rbind(x, x2)
      }
      setkey(x, ID)
      x
    })
    for (x in seq_along(xtot)) {
      setnames(xtot[[x]], c("ID", paste0("V", x)))
    }
    xtot <- Reduce(function(...) merge(..., all = TRUE, by = "ID"), xtot)
    xtot
  },
  uwe = {
    dt <- copy(lsr2i)
    CJ(ID = dt$ID, date = idates, unique = TRUE)[, AH := 0L][
      dt, on = .(ID, date >= eksd, date < ENDDATE), 
      AH := as.integer(ENDDATE - x.date)][, dcast(.SD, ID ~ date)]
  },
  times = 1L
)
print(bm)

The result for one run shows that the non-equi join is more than 4 times faster than the lapply() approach. 一次运行的结果表明, 非等 lapply()lapply()方法快4倍以上。

 Unit: seconds expr min lq mean median uq max neval minem 27.654703 27.654703 27.654703 27.654703 27.654703 27.654703 1 uwe 5.958907 5.958907 5.958907 5.958907 5.958907 5.958907 1 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM