简体   繁体   English

data.table:子集并查找每行的累积乘积

[英]data.table: Subset and find cumulative product for each row

I have a simple dataframe containing three columns: An id, a date and a value.我有一个简单的 dataframe 包含三列:ID、日期和值。 Now, I want to calculate a new value, newValue, based on these three columns following this procedure:现在,我想在此过程之后根据这三列计算一个新值 newValue:

  1. For each row (ie, for each pair of (id, date))对于每一行(即,对于每一对 (id, date))
  2. For all dates in the range (date, date+2) I want to find the cumulative product of the values of that id (and then subtract 1)对于范围内的所有日期(日期,日期 + 2),我想找到该 id 值的累积乘积(然后减去 1)

The simple example below with made-up numbers does the computation:下面带有虚构数字的简单示例进行计算:

df <- data.frame("id"=rep(1:10, 5),
                 "date"=c(rep(2000, 10), rep(2001, 10), rep(2002, 10), rep(2003, 10), rep(2004, 10)),
                 "value"=c(rep(1, 10), rep(2, 10), rep(3, 10), rep(4, 10), rep(5, 10)))

df$newValue <- 1 #initialize

for(idx in 1:dim(df)[1]) {
  id <- df[idx, "id"]
  lower <- df[idx, "date"]
  upper <- lower + 3
  
  df[idx, "newValue"] <- prod(df[(df$id == id) & (df$date >= lower) & (df$date < upper), ]$value + 1) - 1
}

This gives me the output (I have annotated it for simplicity):这给了我 output (为简单起见,我对其进行了注释):

   id date value newValue
1   1 2000     1       23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
2   2 2000     1       23 (= (1+1) * (2+1) * (3+1) - 1 = 23)
....
12  2 2001     2       59 (= (2+1) * (3+1) * (4+1) - 1 = 59)
....
22  2 2002     3      119 (= (3+1) * (4+1) * (5+1) - 1 = 119)
....

However, my final dataframe has +1million rows, so the code above is very time-consuming and inefficient.但是,我最终的 dataframe 有 +100 万行,所以上面的代码非常耗时且效率低下。

Is there a way to speed it up, perhaps using a data.table?有没有办法加快速度,也许使用 data.table? Note that each id may have a different number of rows, so that I why I explicitly subset.请注意,每个 id 可能有不同的行数,所以我为什么要明确设置子集。

library(data.table)
library(purrr)

setDT(df)[, newValue := map_dbl(date, ~prod(value[between(date, .x, .x + 2)] + 1) - 1), by = id]

gives (only showing for id = 1 ):给出(仅显示id = 1 ):

     id date value newValue
 1:  1 2000     1       23
 2:  1 2001     2       59
 3:  1 2002     3      119
 4:  1 2003     4       29
 5:  1 2004     5        5

update :更新

because every date is at most once in each id this should be more efficient:因为每个date在每个id中最多一次,所以这应该更有效:

df <- setDT(df)[order(id, date)]

df[, 
  newValue := map2_dbl(
    date, map(seq_len(.N), ~.x:min(.x+2, .N)), 
    ~prod(value[.y][between(date[.y], .x, .x + 2)] + 1) - 1
  ), 
  by = id
]

if you want some other number than 2 you can create some varialbe date_range and replace 2 with date_range如果您想要2以外的其他数字,您可以创建一些变量date_range并将2替换为date_range

I think this tidyverse solution can also do the job.我认为这个 tidyverse 解决方案也可以完成这项工作。

In order to address the missing year/date problem, I have deleted two rows from id == 2. Sample data used为了解决缺少年份/日期的问题,我从 id == 2 中删除了两行。使用的示例数据

> dput(df)
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 
6L, 6L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 
9L, 10L, 10L, 10L, 10L, 10L), date = c(2000, 2001, 2002, 2003, 
2004, 2000, 2001, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 
2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 
2003, 2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 
2004, 2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004
), value = c(1, 2, 3, 4, 5, 1, 2, 5, 1, 2, 3, 4, 5, 1, 2, 3, 
4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 
5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5)), class = "data.frame", row.names = c(NA, 
-48L))

df

# A tibble: 48 x 3
      id  date value
   <int> <dbl> <dbl>
 1     1  2000     1
 2     1  2001     2
 3     1  2002     3
 4     1  2003     4
 5     1  2004     5
 6     2  2000     1
 7     2  2001     2
 8     2  2004     5
 9     3  2000     1
10     3  2001     2
# ... with 38 more rows

Now the tidyverse solution part现在是 tidyverse 解决方案部分

library(tidyverse)

df %>% arrange(id, date) %>%
  group_by(id) %>%
  complete(date = min(date):max(date), fill = list(value = 0)) %>%
  mutate(new_val = (value +1)*(lead(value, default = 0)+1)*(lead(value, n=2, default = 0)+1)-1) %>%
  ungroup()

# A tibble: 50 x 4
      id  date value new_val
   <int> <dbl> <dbl>   <dbl>
 1     1  2000     1      23
 2     1  2001     2      59
 3     1  2002     3     119
 4     1  2003     4      29
 5     1  2004     5       5
 6     2  2000     1       5
 7     2  2001     2       2
 8     2  2002     0       5
 9     2  2003     0       5
10     2  2004     5       5
# ... with 40 more rows

EDIT Moreover, if extra years have to be removed编辑此外,如果必须删除额外的年份

df %>% arrange(id, date) %>%
  group_by(id) %>%
  complete(date = min(date):max(date), fill = list(value = 0)) %>%
  mutate(new_val = (value +1)*(lead(value, default = 0)+1)*(lead(value, n=2, default = 0)+1)-1) %>%
  ungroup() %>% right_join(df, by = c("id", "date", "value"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM