R - 应用 diff() function 或等效的自定义 function 在 Z1639B13BE20377B20 中的多个列上

Question

currently have a data.table that looks roughly like this:目前有一个data.table大致如下所示：

ID   Date         Var1   Var2   Var3   Var4
1    2020-03-01   AB     A33    250    12
1    2020-04-01   B      B25    NA     14
1    2020-05-01   AB     A44    270    20
1    2020-06-01   AC     C33    9     13
2    2019-09-01   X      C55    280    11
2    2019-10-01   K      C89    120    12
2    2019-11-01   A      C89    320    NA
2    2019-12-01   AB     A88    200    25

This data table stores the key ID and some corresponding variables.该数据表存储了密钥ID和一些对应的变量。 Some are type char and some type numeric .有些是char类型，有些是numeric类型。 The table is sorted with setkey(dt, ID, Date) I want to compute the lagged difference for each numeric variable within each ID.该表使用setkey(dt, ID, Date)进行排序我想计算每个 ID 中每个数字变量的滞后差。

In my data I have the numeric columns extracted in vectors that look like this.在我的数据中，我在看起来像这样的向量中提取了数字列。

cols <- c("Var3", "Var4")
cols_indx <- c(5:6)

Then I want to add new columns with the lagged difference of the numeric variables Var5 and Var6 to my data.table dt .然后，我想将具有数值变量Var5和Var6的滞后差异的新列添加到我的 data.table dt中。

I try:我尝试：

# Doesn't work    
as.data.frame(lapply(dt[ , cols, with = FALSE], diff, lag = 1))
as.data.frame(lapply(dt[ , cols_indx, with = FALSE], diff, lag = 1))
as.data.frame(lapply(dt[ , .SD, .SDcols = cols], diff, lag = 1))
as.data.frame(lapply(dt[ , .SD, .SDcols = cols_indx], diff, lag = 1))

On my data none works and results in r[i1] - r[-length(r):-(length(r) - lag + 1L)]: non-numeric argument for binary operator .在我的数据中没有一个有效并导致r[i1] - r[-length(r):-(length(r) - lag + 1L)]: non-numeric argument for binary operator 。 I can't seem to figure out what is causing this especially as I don't see a binary operator anywhere within this code.我似乎无法弄清楚是什么原因造成的，尤其是因为我在这段代码中的任何地方都没有看到二进制运算符。

However, once I excplicitly state either the colnames or the col indices, all works fine.但是，一旦我明确 state 或者 colnames 或 col 索引，一切正常。 Why is that?这是为什么？ In my case I need to shift a long data.table with > 250 columns and then compute the lagged differences or all those columns and all that for multiple lag intervals.在我的情况下，我需要将长 data.table 与 > 250 列一起移动，然后计算滞后差异或所有这些列以及多个滞后间隔的所有列。 It is not manageable to define all selected columns by hand.手动定义所有选定的列是不可管理的。 What am I missing here?我在这里想念什么？

# Works    
as.data.frame(lapply(dt[ , 5:6], diff, lag = 1))
as.data.frame(lapply(financials.dt[ , c("Var4", "Var5")], diff, lag = 1))

Additionally, one step is still missing.此外，还缺少一个步骤。 I want to compute the lagged differences within each group (defined by ID ).我想计算每个组内的滞后差异（由ID定义）。 When I try diff and a self-defined function both throw similar errors.当我尝试diff和自定义 function 时，都会抛出类似的错误。

i <- 1
lag_names_diff <- paste(cols, "Lag", i, "d", sep = "_")

dt[ , (lag_names_diff) := lapply(.SD, function(x) x - shift(x, (i), type = "lag")),
       .SDcols = cols, by = ID] 
# Error 1:
# r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument for binary operator

# or

dt[ , (lag_names_diff) := lapply(.SD, diff, x = cols, lag = i, differences = 1),
      .SDcols = cols, by = ID]
# Error 2:
# x - shift(x, (i), type = "lag") : non-numeric argument for binary operator

... everything breaks down with the error message. ...一切都因错误消息而崩溃。 I cannot seem to figure out what is causing this.我似乎无法弄清楚是什么原因造成的。 Highly appreciate any pointer.非常感谢任何指针。

Answer 1

The error seems because diff(any_vector) returns a vector but length one shorter than any_vector .该错误似乎是因为diff(any_vector)返回一个向量，但长度比any_vector短。 See this看到这个

diff(1:5)
[1] 1 1 1 1

So if diff is to be applied on any variable in a table, one element has to be added in the result either at end or at start.因此，如果要将diff应用于表中的任何变量，则必须在结果中添加一个元素，无论是在结束时还是在开始时。 Although I am not sure of your expected outcome, still I presume this.尽管我不确定您的预期结果，但我仍然假设这一点。 (I am adding NA to the starting of resulting vector. You may add 0 as well, if so desired. （我将NA添加到结果向量的开头。如果需要，您也可以添加0 。

library(dplyr)
df %>% mutate(across(cols, ~c(NA, diff(.)), .names = "{.col}_diff"))

  ID       Date Var1 Var2 Var3 Var4 Var3_diff Var4_diff
1  1 2020-03-01   AB  A33  250   12        NA        NA
2  1 2020-04-01    B  B25   NA   14        NA         2
3  1 2020-05-01   AB  A44  270   20        NA         6
4  1 2020-06-01   AC  C33    9   13      -261        -7
5  2 2019-09-01    X  C55  280   11       271        -2
6  2 2019-10-01    K  C89  120   12      -160         1
7  2 2019-11-01    A  C89  320   NA       200        NA
8  2 2019-12-01   AB  A88  200   25      -120        NA

Or if grouped on ID is required或者如果需要按ID分组

df %>% group_by(ID) %>%
  mutate(across(cols, ~c(NA, diff(.)), .names = "{.col}_diff"))

# A tibble: 8 x 8
# Groups:   ID [2]
     ID Date       Var1  Var2   Var3  Var4 Var3_diff Var4_diff
  <int> <chr>      <chr> <chr> <int> <int>     <int>     <int>
1     1 2020-03-01 AB    A33     250    12        NA        NA
2     1 2020-04-01 B     B25      NA    14        NA         2
3     1 2020-05-01 AB    A44     270    20        NA         6
4     1 2020-06-01 AC    C33       9    13      -261        -7
5     2 2019-09-01 X     C55     280    11        NA        NA
6     2 2019-10-01 K     C89     120    12      -160         1
7     2 2019-11-01 A     C89     320    NA       200        NA
8     2 2019-12-01 AB    A88     200    25      -120        NA

R - 应用 diff() function 或等效的自定义 function 在 Z1639B13BE20377B20 中的多个列上

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-07 13:30:12

R - 应用 diff() function 或等效的自定义 function 在 Z1639B13BE20377B20 中的多个列上

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-07 13:30:12

解决方案1
1 已采纳 2021-04-07 13:30:12