简体   繁体   English

R:使用 data.table := 操作来计算新列

[英]R: using data.table := operations to calculate new columns

Let's take the following data:让我们取以下数据:

dt <- data.table(TICKER=c(rep("ABC",10),"DEF"),
        PERIOD=c(rep(as.Date("2010-12-31"),10),as.Date("2011-12-31")),
        DATE=as.Date(c("2010-01-05","2010-01-07","2010-01-08","2010-01-09","2010-01-10","2010-01-11","2010-01-13","2010-04-01","2010-04-02","2010-08-03","2011-02-05")),
        ID=c(1,2,1,3,1,2,1,1,2,2,1),VALUE=c(1.5,1.3,1.4,1.6,1.4,1.2,1.5,1.7,1.8,1.7,2.3))
setkey(dt,TICKER,PERIOD,ID,DATE)

Now for each ticker/period combination, I need the following in a new column:现在对于每个代码/周期组合,我需要在新列中包含以下内容:

  • PRIORAVG : The mean of the latest VALUE of each ID, excluding the current ID, providing it is no more than 180 days old. PRIORAVG :每个 ID 的最新 VALUE 的平均值,不包括当前 ID, PRIORAVG是它不超过 180 天。
  • PREV : The previous value from the same ID. PREV :来自同一 ID 的前一个值。

The result should look like this:结果应如下所示:

      TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
 [1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
 [2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
 [3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
 [4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
 [5,]    ABC 2010-12-31 2010-04-01  1   1.7     1.40  1.5
 [6,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
 [7,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
 [8,]    ABC 2010-12-31 2010-04-02  2   1.8     1.65  1.2
 [9,]    ABC 2010-12-31 2010-08-03  2   1.7     1.70  1.8
[10,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[11,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

Note the PRIORAVG on row 9 is equal to 1.7 (which is equal to the VALUE on row 5, which is the only prior observation in the past 180 days by another ID )请注意,第 9 行的PRIORAVG等于 1.7(这等于第 5 行的VALUE ,这是过去 180 天内另一个ID的唯一先前观察)

I have discovered the data.table package, but I can't seem to fully understand the := function.我发现了data.table包,但我似乎无法完全理解:=函数。 When I keep it simple, it seems to work.当我保持简单时,它似乎有效。 To obtain the previous value for each ID (I based this on the solution to this question ):以获得每个ID的先前值(I上的溶液到基于此这个问题):

dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),roll=TRUE,mult="last"][,VALUE]]

This works great, and it only takes 0.13 seconds to perform this operation over my dataset with ~250k rows;这很好用,在我的大约 250k 行的数据集上执行这个操作只需要 0.13 秒; my vector scan function gets identical results but is about 30,000 times slower.我的矢量扫描功能得到相同的结果,但速度慢了大约 30,000 倍。

Ok, so I've got my first requirement.好的,所以我有我的第一个要求。 Let's get to the second, more complex requirement.让我们来看看第二个更复杂的需求。 Right now the fasted method so far for me is using a couple of vector scans and throwing the function through the plyr function adply to get the result for each row.现在,到目前为止对我来说禁食的方法是使用几次矢量扫描并通过plyr函数adply plyr函数以获得每一行的结果。

calc <- function(df,ticker,period,id,date) {
  df <- df[df$TICKER == ticker & df$PERIOD == period 
        & df$ID != id & df$DATE < date & df$DATE > date-180, ]
  df <- df[order(df$DATE),]
  mean(df[!duplicated(df$ID, fromLast = TRUE),"VALUE"])
}

df <- data.frame(dt)
adply(df,1,function(x) calc(df,x$TICKER,x$PERIOD,x$ID,x$DATE))

I wrote the function for a data.frame and it does not seem to work with a data.table .我为data.frame编写了函数,但它似乎不适用于data.table For a subset of 5000 rows this takes about 44 seconds but my data consists of > 1 million rows.对于 5000 行的子集,这大约需要 44 秒,但我的数据包含 > 100 万行。 I wonder if this can be made more efficient through the usage of := .我想知道是否可以通过使用:=来提高效率。

dt[J("ABC"),last(VALUE),by=ID][,mean(V1)]

This works to select the average of the latest VALUEs for each ID for ABC.这适用于为 ABC 的每个 ID 选择最新 VALUE 的平均值。

dt[,PRIORAVG:=dt[J(TICKER,PERIOD),last(VALUE),by=ID][,mean(V1)]]

This, however, does not work as expected, as it takes the average of all last VALUEs for all ticker/periods instead of only for the current ticker/period.然而,这并不像预期的那样工作,因为它采用所有股票代码/周期的所有最后一个 VALUE 的平均值,而不是仅用于当前股票代码/周期。 So it ends up with all rows getting the same mean value.所以最终所有行都获得相同的平均值。 Am I doing something wrong or is this a limitation of := ?我做错了什么还是这是:=的限制?

Great question.很好的问题。 Try this :试试这个 :

dt
     TICKER     PERIOD       DATE ID VALUE
[1,]    ABC 2010-12-31 2010-01-05  1   1.5
[2,]    ABC 2010-12-31 2010-01-08  1   1.4
[3,]    ABC 2010-12-31 2010-01-10  1   1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5
[5,]    ABC 2010-12-31 2010-01-07  2   1.3
[6,]    ABC 2010-12-31 2010-01-11  2   1.2
[7,]    ABC 2010-12-31 2010-01-09  3   1.6
[8,]    DEF 2011-12-31 2011-02-05  1   2.3

ids = unique(dt$ID)
dt[,PRIORAVG:=NA_real_]
for (i in 1:nrow(dt))
    dt[i,PRIORAVG:=dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
                      mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"]]
dt
     TICKER     PERIOD       DATE ID VALUE PRIORAVG
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA

Then what you had already with a slight simplification ...那么你已经有了一些简单的简化......

dt[,PREV:=dt[J(TICKER,PERIOD,ID,DATE-1),VALUE,roll=TRUE,mult="last"]]

     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

If this is ok as a prototype then a large speed improvement would be to keep the loop but use set() instead of := , to reduce overhead :如果这可以作为原型,那么大的速度改进将是保持循环但使用set()而不是:= ,以减少开销:

for (i in 1:nrow(dt))
    set(dt,i,6L,dt[J(TICKER[i],PERIOD[i],setdiff(ids,ID[i]),DATE[i]),
                   mean(VALUE,na.rm=TRUE),roll=TRUE,mult="last"])
dt
     TICKER     PERIOD       DATE ID VALUE PRIORAVG PREV
[1,]    ABC 2010-12-31 2010-01-05  1   1.5       NA   NA
[2,]    ABC 2010-12-31 2010-01-08  1   1.4     1.30  1.5
[3,]    ABC 2010-12-31 2010-01-10  1   1.4     1.45  1.4
[4,]    ABC 2010-12-31 2010-01-13  1   1.5     1.40  1.4
[5,]    ABC 2010-12-31 2010-01-07  2   1.3     1.50   NA
[6,]    ABC 2010-12-31 2010-01-11  2   1.2     1.50  1.3
[7,]    ABC 2010-12-31 2010-01-09  3   1.6     1.35   NA
[8,]    DEF 2011-12-31 2011-02-05  1   2.3       NA   NA

That should be a lot faster than the repeated vector scans shown in the question.这应该比问题中显示的重复矢量扫描快得多。

Or, the operation could be vectorized.或者,可以对操作进行矢量化。 But that would be less easy to write and read due to the features of this task.但是由于此任务的特性,这将不太容易编写和阅读。

Btw, there isn't any data in the question that would test the 180 day requirement.顺便说一句,问题中没有任何数据可以测试 180 天的要求。 If you add some and show expected output again then I'll add the calculation of age using join inherited scope I mentioned in comments.如果您添加一些并再次显示预期输出,那么我将使用我在评论中提到的连接继承范围添加年龄计算。

Another possible approach using later versions of data.table :使用更高版本的data.table另一种可能方法:

library(data.table) #data.table_1.12.6 as of Nov 20, 2019
cols <- copy(names(DT))
DT[, c("MIN_DATE", "MAX_DATE") := .(DATE - 180L, DATE)]

DT[, PRIORAVG := 
        .SD[.SD, on=.(TICKER, PERIOD, DATE>=MIN_DATE, DATE<=MAX_DATE),
            by=.EACHI, {
                subdat <- .SD[x.ID!=i.ID]
                pavg <- if (subdat[, .N > 0L])
                    mean(subdat[, last(VALUE), ID]$V1, na.rm=TRUE)
                else 
                    NA_real_
                c(setNames(mget(paste0("i.", cols)), cols), .(PRIORAVG=pavg))
            }]$PRIORAVG
]

DT[, PREV := shift(VALUE), .(TICKER, PERIOD, ID)]

output:输出:

    TICKER     PERIOD       DATE ID VALUE   MIN_DATE   MAX_DATE PRIORAVG PREV
 1:    ABC 2010-12-31 2010-01-05  1   1.5 2009-07-09 2010-01-05       NA   NA
 2:    ABC 2010-12-31 2010-01-08  1   1.4 2009-07-12 2010-01-08     1.30  1.5
 3:    ABC 2010-12-31 2010-01-10  1   1.4 2009-07-14 2010-01-10     1.45  1.4
 4:    ABC 2010-12-31 2010-01-13  1   1.5 2009-07-17 2010-01-13     1.40  1.4
 5:    ABC 2010-12-31 2010-04-01  1   1.7 2009-10-03 2010-04-01     1.40  1.5
 6:    ABC 2010-12-31 2010-01-07  2   1.3 2009-07-11 2010-01-07     1.50   NA
 7:    ABC 2010-12-31 2010-01-11  2   1.2 2009-07-15 2010-01-11     1.50  1.3
 8:    ABC 2010-12-31 2010-04-02  2   1.8 2009-10-04 2010-04-02     1.65  1.2
 9:    ABC 2010-12-31 2010-08-03  2   1.7 2010-02-04 2010-08-03     1.70  1.8
10:    ABC 2010-12-31 2010-01-09  3   1.6 2009-07-13 2010-01-09     1.35   NA
11:    DEF 2011-12-31 2011-02-05  1   2.3 2010-08-09 2011-02-05       NA   NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM