[英]R - apply diff() function or equivalent self-defined function on multiple columns in a data.table
currently have a data.table
that looks roughly like this:目前有一个data.table
大致如下所示:
ID Date Var1 Var2 Var3 Var4
1 2020-03-01 AB A33 250 12
1 2020-04-01 B B25 NA 14
1 2020-05-01 AB A44 270 20
1 2020-06-01 AC C33 9 13
2 2019-09-01 X C55 280 11
2 2019-10-01 K C89 120 12
2 2019-11-01 A C89 320 NA
2 2019-12-01 AB A88 200 25
This data table stores the key ID
and some corresponding variables.该数据表存储了密钥ID
和一些对应的变量。 Some are type char
and some type numeric
.有些是char
类型,有些是numeric
类型。 The table is sorted with setkey(dt, ID, Date)
I want to compute the lagged difference for each numeric variable within each ID.该表使用setkey(dt, ID, Date)
进行排序我想计算每个 ID 中每个数字变量的滞后差。
In my data I have the numeric columns extracted in vectors that look like this.在我的数据中,我在看起来像这样的向量中提取了数字列。
cols <- c("Var3", "Var4")
cols_indx <- c(5:6)
Then I want to add new columns with the lagged difference of the numeric variables Var5
and Var6
to my data.table dt
.然后,我想将具有数值变量Var5
和Var6
的滞后差异的新列添加到我的 data.table dt
中。
I try:我尝试:
# Doesn't work
as.data.frame(lapply(dt[ , cols, with = FALSE], diff, lag = 1))
as.data.frame(lapply(dt[ , cols_indx, with = FALSE], diff, lag = 1))
as.data.frame(lapply(dt[ , .SD, .SDcols = cols], diff, lag = 1))
as.data.frame(lapply(dt[ , .SD, .SDcols = cols_indx], diff, lag = 1))
On my data none works and results in r[i1] - r[-length(r):-(length(r) - lag + 1L)]: non-numeric argument for binary operator .在我的数据中没有一个有效并导致r[i1] - r[-length(r):-(length(r) - lag + 1L)]: non-numeric argument for binary operator 。 I can't seem to figure out what is causing this especially as I don't see a binary operator anywhere within this code.我似乎无法弄清楚是什么原因造成的,尤其是因为我在这段代码中的任何地方都没有看到二进制运算符。
However, once I excplicitly state either the colnames or the col indices, all works fine.但是,一旦我明确 state 或者 colnames 或 col 索引,一切正常。 Why is that?这是为什么? In my case I need to shift a long data.table with > 250 columns and then compute the lagged differences or all those columns and all that for multiple lag intervals.在我的情况下,我需要将长 data.table 与 > 250 列一起移动,然后计算滞后差异或所有这些列以及多个滞后间隔的所有列。 It is not manageable to define all selected columns by hand.手动定义所有选定的列是不可管理的。 What am I missing here?我在这里想念什么?
# Works
as.data.frame(lapply(dt[ , 5:6], diff, lag = 1))
as.data.frame(lapply(financials.dt[ , c("Var4", "Var5")], diff, lag = 1))
Additionally, one step is still missing.此外,还缺少一个步骤。 I want to compute the lagged differences within each group (defined by ID
).我想计算每个组内的滞后差异(由ID
定义)。 When I try diff
and a self-defined function both throw similar errors.当我尝试diff
和自定义 function 时,都会抛出类似的错误。
i <- 1
lag_names_diff <- paste(cols, "Lag", i, "d", sep = "_")
dt[ , (lag_names_diff) := lapply(.SD, function(x) x - shift(x, (i), type = "lag")),
.SDcols = cols, by = ID]
# Error 1:
# r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument for binary operator
# or
dt[ , (lag_names_diff) := lapply(.SD, diff, x = cols, lag = i, differences = 1),
.SDcols = cols, by = ID]
# Error 2:
# x - shift(x, (i), type = "lag") : non-numeric argument for binary operator
... everything breaks down with the error message. ...一切都因错误消息而崩溃。 I cannot seem to figure out what is causing this.我似乎无法弄清楚是什么原因造成的。 Highly appreciate any pointer.非常感谢任何指针。
The error seems because diff(any_vector)
returns a vector but length one shorter than any_vector
.该错误似乎是因为diff(any_vector)
返回一个向量,但长度比any_vector
短。 See this看到这个
diff(1:5)
[1] 1 1 1 1
So if diff
is to be applied on any variable in a table, one element has to be added in the result either at end or at start.因此,如果要将diff
应用于表中的任何变量,则必须在结果中添加一个元素,无论是在结束时还是在开始时。 Although I am not sure of your expected outcome, still I presume this.尽管我不确定您的预期结果,但我仍然假设这一点。 (I am adding NA
to the starting of resulting vector. You may add 0
as well, if so desired. (我将NA
添加到结果向量的开头。如果需要,您也可以添加0
。
library(dplyr)
df %>% mutate(across(cols, ~c(NA, diff(.)), .names = "{.col}_diff"))
ID Date Var1 Var2 Var3 Var4 Var3_diff Var4_diff
1 1 2020-03-01 AB A33 250 12 NA NA
2 1 2020-04-01 B B25 NA 14 NA 2
3 1 2020-05-01 AB A44 270 20 NA 6
4 1 2020-06-01 AC C33 9 13 -261 -7
5 2 2019-09-01 X C55 280 11 271 -2
6 2 2019-10-01 K C89 120 12 -160 1
7 2 2019-11-01 A C89 320 NA 200 NA
8 2 2019-12-01 AB A88 200 25 -120 NA
Or if grouped on ID
is required或者如果需要按ID
分组
df %>% group_by(ID) %>%
mutate(across(cols, ~c(NA, diff(.)), .names = "{.col}_diff"))
# A tibble: 8 x 8
# Groups: ID [2]
ID Date Var1 Var2 Var3 Var4 Var3_diff Var4_diff
<int> <chr> <chr> <chr> <int> <int> <int> <int>
1 1 2020-03-01 AB A33 250 12 NA NA
2 1 2020-04-01 B B25 NA 14 NA 2
3 1 2020-05-01 AB A44 270 20 NA 6
4 1 2020-06-01 AC C33 9 13 -261 -7
5 2 2019-09-01 X C55 280 11 NA NA
6 2 2019-10-01 K C89 120 12 -160 1
7 2 2019-11-01 A C89 320 NA 200 NA
8 2 2019-12-01 AB A88 200 25 -120 NA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.