简体   繁体   English

如何根据每周日期创建移动平均值,并按data.table中的多列分组?

[英]How do I create a moving average based on weekly dates, grouped by multiple columns in data.table?

I am reading in an extremely large dataset as a data.table for speed. 我正在读取一个非常大的数据集作为data.table以获取速度。 The relevant columns are DATE (weekly data in year-month-day strings eg "2017-12-25"), V1 (Integer), V2 (String), V3 (Numeric). 相关列为DATE (年月日字符串中的每周数据,例如“ 2017-12-25”), V1 (整数), V2 (字符串), V3 (数值)。 I would like to produce V4 which is the moving average of V3 , for the last 3 weeks ( DATE , DATE -7, and DATE -14) here is a naive attempt/solution, which is terribly inefficient: 我想产生V4 ,这是最近3周( DATEDATE -7和DATE -14)的V3的移动平均值,这是一个幼稚的尝试/解决方案,效率非常低:

dt <- fread("largefile.csv")

dt$DATE <- as.IDate(dt$DATE) //convert dates to date format

V1_list <- sort(unique(dt$V1))

V2_list <- sort(unique(dt$V2))

DATE_list <- sort(unique(dt$DATE))

for(i in 1:length(V1_list)){
for(j in 1:length(V2_list)){
for(k in 3:length(DATE_list){
dt[which(dt$V1 == V1_list[i] && dt$V2 == V2_list[j] && dt$DATE == DATE_list[k]),"V4"] 
<- mean(dt[which(dt$V1 == V1_list[i] && dt$V2 == V2_list[j] && dt$DATE %in% DATE_list[k-2:k]),"V3"])
}
}
}

I am avoiding using plyr partly due to computational constraints given the 50M rows I'm using. 我避免使用plyr部分是由于给定我使用的50M行的计算限制。 I have investigated options with setkey() and zoo / rolling functions but I am unable to figure out how to layer in the date component (assuming I group by V1 , V2 and average V3 ). 我已经研究了setkey()zoo / rolling函数的选项,但是我无法弄清楚如何对日期部分进行分层(假设我按V1V2和平均V3分组)。 Apologies for not providing sample code. 不提供示例代码的道歉。

The OP has requested to append a new column which is the rolling average of V3 over the past 3 weeks grouped by V1 and V2 for a data.table of 50 M rows. OP要求添加一个新列,该列是过去3周内V3的滚动平均值,由V1V2分组,构成一个5000万行的data.table

If the DATE values are without gap , ie, without missing weeks in all groups, one possible approach is to use the rollmeanr() function from the zoo package: 如果DATE值没有间隙 ,即在所有组中都没有丢失星期,则一种可能的方法是使用zoo包中的rollmeanr()函数:

DT[order(DATE), V4 := zoo::rollmeanr(V3, 3L, fill = NA), by = .(V1, V2)]
DT[order(V1, V2, DATE)]
  DATE V1 V2 V3 V4 1: 2017-12-04 1 A 1 NA 2: 2017-12-11 1 A 2 NA 3: 2017-12-18 1 A 3 2 4: 2017-12-25 1 A 4 3 5: 2017-12-04 1 B 5 NA 6: 2017-12-11 1 B 6 NA 7: 2017-12-18 1 B 7 6 8: 2017-12-25 1 B 8 7 9: 2017-12-04 2 A 9 NA 10: 2017-12-11 2 A 10 NA 11: 2017-12-18 2 A 11 10 12: 2017-12-25 2 A 12 11 13: 2017-12-04 2 B 13 NA 14: 2017-12-11 2 B 14 NA 15: 2017-12-18 2 B 15 14 16: 2017-12-25 2 B 16 15 

Note that the NA s are purposefully introduced because we do not have DATE -7 and DATE -14 values for the first two rows within each group. 请注意,有意引入了NA因为对于每个组中的前两行,我们没有DATE -7DATE -14值。

Also note that this approach does not require type conversion of the character dates. 另外请注意,这种方法不需要字符日期的类型转换。

Data 数据

According to OP's description, the data.table has 4 columns: DATE are weekly character dates in standard unambiguous format %Y-%m-%d , V1 is of type integer, V2 is of type character, and V3 is of type double (numeric). 根据OP的描述, data.table有4列: DATE是标准字符格式%Y-%m-%d每周字符日期, V1是整数类型, V2是字符类型, V3是double类型(数字)。 V1 and V2 are used for grouping. V1V2用于分组。

library(data.table)
# create data
n_week = 4L
n_V1 = 2L
# cross join
DT <- CJ(
  DATE = as.character(rev(seq(as.Date("2017-12-25"), length.out = n_week, by = "-1 week"))),
  V1 = seq_len(n_V1),
  V2 = LETTERS[1:2]
)
DT[order(V1, V2, DATE), V3 := as.numeric(seq_len(.N))][]
  DATE V1 V2 V3 1: 2017-12-04 1 A 1 2: 2017-12-04 1 B 5 3: 2017-12-04 2 A 9 4: 2017-12-04 2 B 13 5: 2017-12-11 1 A 2 6: 2017-12-11 1 B 6 7: 2017-12-11 2 A 10 8: 2017-12-11 2 B 14 9: 2017-12-18 1 A 3 10: 2017-12-18 1 B 7 11: 2017-12-18 2 A 11 12: 2017-12-18 2 B 15 13: 2017-12-25 1 A 4 14: 2017-12-25 1 B 8 15: 2017-12-25 2 A 12 16: 2017-12-25 2 B 16 

So I tried to solve your problem using two inner_joins from the dplyr package: 因此,我尝试使用dplyr包中的两个inner_joins解决您的问题:

First I created an example data.frame (1.000.000 rows): 首先,我创建了一个示例data.frame(1.000.000行):

V3 <- seq(from=1, to=1000000, by =1 )
DATE <- seq(from=1, to= 7000000, by =7)
dt <- data.frame(V3, DATE)

Does it look correct? 看起来正确吗? I dropped all unnecessary content and ignored the Date format (you can subtract Dates the same way as integers) 我删除了所有不必要的内容,并忽略了日期格式(您可以用与整数相同的方式减去日期)

Next, I did two innerjoins on the DATE column but the second data.frame was containing the DATE +7 and DATE +14 so you join on the correct Dates. 接下来,我在DATE列上执行了两个内部联接,但是第二个data.frame包含DATE +7和DATE +14,因此您在正确的Date上联接。 Finally, i select the 3 interesting columns and computed the rowMean. 最后,我选择了3个有趣的列并计算了rowMean。 I took like 5 seconds on my lousy MacBook. 我在糟糕的MacBook上花了大约5秒钟的时间。

inner_join(
    inner_join(x= dt, y=mutate(dt, DATE=DATE+7), by= 'DATE'),
    y = mutate(dt, DATE= DATE+14), by= 'DATE')  %>% 
    select(V3 , V3.y, V3.x) %>% 
    rowMeans()

and if you want to add it to your dt remember that for the first 2 dates there is no average because no DATE-14 and DATE-7 exists. 如果要将其添加到dt中,请记住前两个日期没有平均值,因为不存在DATE-14和DATE-7。

dt$V4 <-   c(NA, NA, inner_join(
        inner_join(x= dt, y=mutate(dt, DATE=DATE+7), by= 'DATE'),
        y = mutate(dt, DATE= DATE+14), by= 'DATE')  %>% 
        select(V3 , V3.y, V3.x) %>% 
        rowMeans())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 多行数据表的4行移动平均值 - Moving average of 4 rows of data.table with multiple columns R - 如何基于多个因素在不同的 data.table 列上运行平均值和最大值并返回原始列名 - R - How to run average & max on different data.table columns based on multiple factors & return original colnames 如何创建 data.table 的列,即 function 的 output 输入多列 Z20339B13B20F37E - How can I create a column of data.table that is the output of a function with input multiple columns of the data.table 如何使用 data.table 创建均值和 sd 列(基于多个条件) - How to create means and s.d. columns with data.table (based on multiple conditions) 如何基于data.table中的其他列创建新列? - How to create a new column based on other columns in a data.table? 如何基于data.table中的其他列创建索引列? - How to create an indexed column based in other columns in a data.table? 根据 data.table 中的计数创建列 - Create columns based on count in data.table 如何根据具有关系的多列在 R 中订购 data.table? - How to order a data.table in R based on multiple columns with ties? 如何基于现有的 data.table 在 for 循环中创建新的 data.table object 并更新列? - How to create a new data.table object in for loop based off an existing data.table and update columns? 基于多列的数据表排序 - Sorting Data.Table Based on Multiple Columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM