[英]“recycling” error in user-defined function for data.table
I've joined two data tables and am calculating means based on a subset of that data.我加入了两个数据表,并根据该数据的一个子集计算平均值。 The code below runs properly when it's not within a function that I wrote, but I'm getting this error when I try to use the function:当下面的代码不在我编写的函数内时它可以正常运行,但是当我尝试使用该函数时出现此错误:
Error in `[.data.table`(poll.name, AQ.Date >= Cdate & AQ.Date < Cdate + :
i evaluates to a logical vector length 159 but there are 2797432 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
My function:我的功能:
myfunc <- function(linked.dat, poll.name) {
linked.dat[,
`:=` (t1.avg = mean(poll.name[AQ.Date >= Cdate & AQ.Date < Cdate + 1], na.rm = TRUE),
t2.avg = mean(poll.name[AQ.Date >= Cdate + 1 & AQ.Date < Cdate + 2], na.rm = TRUE),
t3.avg = mean(poll.name[AQ.Date >= Cdate + 2 & AQ.Date <= Bdate], na.rm = TRUE),
total.avg = mean(poll.name)),
by = ID]
linked.pollname <- linked.dat
return(linked.pollname)
}
So using this function with the example df would look like:因此,将此函数与示例 df 一起使用将如下所示:
myfunc(df, O3)
Some data:一些数据:
df <- structure(list(O3 = c(21.1, 27.3, 23.8, 29.5, 23.8, 27.1, 31.6,
25.8, 31.2, 14, 19.1, 15.5, 15.6, 28.6, 16.9, 27.4, 30.1, 24.4,
21.2, 22.1, 26.1, 19.9), AQ.Date = structure(c(3679, 3681, 3682,
3683, 3680, 3685, 3686, 3687, 3684, 3689, 3673, 3675, 3677, 3678,
3686, 3687, 3688, 3692, 3681, 3693, 3695, 3696), class = "Date"),
ID = c("a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"
), Cdate = structure(c(3673, 3673, 3673, 3673, 3673,
3673, 3673, 3673, 3673, 3673, 3673, 3673, 3677, 3677, 3677,
3677, 3677, 3677, 3677, 3677, 3677, 3677), class = "Date"),
Bdate = structure(c(3690, 3690, 3690, 3690, 3690, 3690,
3690, 3690, 3690, 3690, 3690, 3690, 3696, 3696, 3696, 3696,
3696, 3696, 3696, 3696, 3696, 3696), class = "Date"), Total_weeks = c(2.428571,
2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 2.428571,
2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 2.714286,
2.714286, 2.714286, 2.714286, 2.714286, 2.714286, 2.714286,
2.714286, 2.714286, 2.714286)), row.names = c(NA, -22L), class = "data.frame")
setDT(df)
I'm not understanding what this error means.我不明白这个错误是什么意思。 What is the recycling referring to?回收是指什么? Why is it only happening within the function?为什么它只发生在函数内? How can I adjust the function to address the error?如何调整函数以解决错误?
Recycling has to do with how vectors of different lengths are combined into a data.frame
(and some other places).回收与如何将不同长度的向量组合成一个data.frame
(以及其他一些地方)有关。 Every column of a data.frame
(and therefore a data.table
and tbl_df
) must be the same length, and when something is not the same length it is recycled . data.frame
每一列(因此是data.table
和tbl_df
)必须具有相同的长度,并且当某些东西的长度不同时,它会被回收。
In most (all?) base R functions, recycling is done silently as long as the longest vector is an even multiple of the shorter vectors.在大多数(所有?)基础 R 函数中,只要最长的向量是较短向量的偶数倍,就会静默地进行回收。 For instance,例如,
data.frame(x = 1, y = 1:3)
# x y
# 1 1 1
# 2 1 2
# 3 1 3
data.frame(x = 1:2, y = 1:4)
# x y
# 1 1 1
# 2 2 2
# 3 1 3
# 4 2 4
but R will error (usually, but not in all cases) when a non-even combination is provided:但是当提供非偶数组合时,R 会出错(通常,但并非在所有情况下):
data.frame(x = 1:3, y = 1:4)
# Error in data.frame(x = 1:3, y = 1:4) :
# arguments imply differing number of rows: 3, 4
My personal opinion is that recycling is a balance between convenience and safety, where "convenience" is that I want to add a column with a single invariant value to a frame with multiple rows, as in the first example above;我个人的观点是回收是方便和安全的平衡,这里的“方便”是我想在多行的框架中添加一个具有单一不变值的列,如上面的第一个例子; "safety" is that you are certain what each function is returning (eg, length) and surprised are not hidden. “安全性”是您确定每个函数返回的内容(例如,长度)和惊奇没有被隐藏。
For the latter, consider a custom function (meant to mimic which.min
) that finds the location of the minimum value:对于后者,请考虑使用自定义函数(旨在模仿which.min
)来查找最小值的位置:
myfunc <- function(x) which(x == min(x)) # this is naive, do not use it
With "normal" data, it will return a single value, as in对于“正常”数据,它将返回单个值,如
set.seed(42)
myfunc(runif(10))
# [1] 8
However, perhaps when dealing with integers or something else where equality can happen (and in some rare numeric
instances), one might get more than one:然而,也许在处理整数或其他可能发生相等的事情时(以及在一些罕见的numeric
实例中),人们可能会得到不止一个:
myfunc(sample(10, size = 11, replace = TRUE))
# [1] 2 10
Because of this, if you rely on it returning a single value but it instead returns two or more, then ... something you rely on might do silent recycling and you are none the wiser.因此,如果您依赖它返回单个值,但它返回两个或更多,那么……您依赖的某些东西可能会进行静默回收,而您一点也不聪明。 For instance,例如,
set.seed(3)
mydat <- data.frame(x = sample(10, size = 12, replace = TRUE))
mydat$y <- myfunc(mydat$x)
mydat
# x y
# 1 5 4
# 2 10 8
# 3 7 4
# 4 4 8
# 5 10 4
# 6 8 8
# 7 8 4
# 8 4 8
# 9 10 4
# 10 7 8
# 11 8 4
# 12 8 8
From my perspective, recycling is only "acceptable" when it's an all-or-1 thing ... anything else can be used correctly in many places but in my opinion should really be explicit.从我的角度来看,回收只有在“全有或一”的情况下才是“可接受的”……其他任何东西都可以在很多地方正确使用,但在我看来应该是明确的。
tibble
allows all-or-1, otherwise it errors: tibble
允许 all-or-1,否则会出错:
library(tibble)
tibble(x = 1, y = 1:3)
# # A tibble: 3 x 2
# x y
# <dbl> <int>
# 1 1 1
# 2 1 2
# 3 1 3
tibble(x = 1:2, y = 1:3)
# Error: Tibble columns must have compatible sizes.
# * Size 2: Existing data.
# * Size 3: Column `y`.
# i Only values of size one are recycled.
You are trying to do non-standard evaluation of the symbol O3
outside of the data.table
construct.您正在尝试对data.table
构造之外的符号O3
进行非标准评估。 I believe you are intending to take the mean of a user-provided column of the frame based on other conditions.我相信您打算根据其他条件采用用户提供的框架列的平均值。
Here's one way to get around to doing it: pass a string, and use get(poll.name)
(whereever you need the data) within the data.table
to get at the data:这里有一种方法来避开这样做:传递字符串和使用get(poll.name)
徘徊无论你需要的数据)内的data.table
得到的数据:
myfunc <- function(linked.dat, poll.name) {
linked.dat[,
`:=` (t1.avg = mean(get(poll.name)[AQ.Date >= Cdate & AQ.Date < Cdate + 1], na.rm = TRUE),
t2.avg = mean(get(poll.name)[AQ.Date >= Cdate + 1 & AQ.Date < Cdate + 2], na.rm = TRUE),
t3.avg = mean(get(poll.name)[AQ.Date >= Cdate + 2 & AQ.Date <= Bdate], na.rm = TRUE),
total.avg = mean(get(poll.name))),
by = ID]
linked.pollname <- linked.dat
return(linked.pollname)
}
myfunc(df, "O3")
# O3 AQ.Date ID Cdate Bdate Total_weeks t1.avg t2.avg t3.avg total.avg
# 1: 21.1 1980-01-28 a 1980-01-22 1980-02-08 2.428571 19.1 NaN 24.60909 24.15
# 2: 27.3 1980-01-30 a 1980-01-22 1980-02-08 2.428571 19.1 NaN 24.60909 24.15
# 3: 23.8 1980-01-31 a 1980-01-22 1980-02-08 2.428571 19.1 NaN 24.60909 24.15
# 4: 29.5 1980-02-01 a 1980-01-22 1980-02-08 2.428571 19.1 NaN 24.60909 24.15
# 5: 23.8 1980-01-29 a 1980-01-22 1980-02-08 2.428571 19.1 NaN 24.60909 24.15
# ---
# 18: 24.4 1980-02-10 b 1980-01-26 1980-02-14 2.714286 15.6 28.6 23.51250 23.23
# 19: 21.2 1980-01-30 b 1980-01-26 1980-02-14 2.714286 15.6 28.6 23.51250 23.23
# 20: 22.1 1980-02-11 b 1980-01-26 1980-02-14 2.714286 15.6 28.6 23.51250 23.23
# 21: 26.1 1980-02-13 b 1980-01-26 1980-02-14 2.714286 15.6 28.6 23.51250 23.23
# 22: 19.9 1980-02-14 b 1980-01-26 1980-02-14 2.714286 15.6 28.6 23.51250 23.23
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.