[英]Ifelse behavior within data.table (R)
I have a data.table full of some consumer products. 我有一个data.table充满了一些消费产品。 I've created some distinction for the products as
'low'
, 'high'
, or 'unknown'
quality. 我为产品创造了一些区别,即
'low'
, 'high'
或'unknown'
质量。 The data are time series, and I'm interested in smoothing out some seasonality in the data. 数据是时间序列,我有兴趣平滑数据中的一些季节性。 If a product's raw classification (the classification churned out by the algorithm I used to determine quality) is
'low'
quality in period X, but its raw classification was 'high'
quality in period X-1, I'm reclassifying that product as 'high'
quality for period X. This process is done within some sort of product group distinction. 如果产品的原始分类(我用来确定质量的算法所产生的分类)在X期中是
'low'
质量,但其原始分类在X-1期间是'high'
质量,我将该产品重新分类为X期间的'high'
质量。这个过程是在某种产品组的区别内完成的。
To accomplish this, I've got something like the following: 要做到这一点,我有类似以下内容:
require(data.table)
# lag takes a column and lags it by one period,
# padding with NA
lag <- function(var) {
lagged <- c(NA,
var[1:(length(var)-1)])
return(lagged)
}
set.seed(120)
foo <- data.table(group = c('A', rep(c('B', 'C', 'D'), 5)),
period = c(1:16),
quality = c('unknown', sample(c('high', 'low', 'unknown'), 15, replace = TRUE)))
foo[, quality_lag := lag(quality), by = group]
foo[, quality_1 := ifelse(quality == 'low' & quality_lag == 'high',
'high',
quality)]
Taking a look at foo
: 看看
foo
:
group period quality quality_lag quality_1
1: A 1 unknown NA unknown
2: B 2 low NA NA
3: C 3 high NA high
4: D 4 low NA NA
5: B 5 unknown low unknown
6: C 6 high high high
7: D 7 low low low
8: B 8 unknown unknown unknown
9: C 9 high high high
10: D 10 unknown low unknown
11: B 11 unknown unknown unknown
12: C 12 low high high
13: D 13 unknown unknown unknown
14: B 14 high unknown high
15: C 15 high low high
16: D 16 unknown unknown unknown
So, quality_1
is mostly what I want. 因此,
quality_1
主要是我想要的。 If period X is 'low'
and period X-1 is 'high'
, we see the reclassification to 'high'
occurs and everything is left mostly intact from quality
. 如果期间X为
'low'
且期间X-1为'high'
,我们会看到重新分类为'high'
,并且一切都保持quality
完好无损。 However, when quality_lag
is NA, 'low'
gets reclassified to NA
in quality_1
. 但是,当
quality_lag
为NA时, 'low'
会在quality_1
重新分类为NA
。 This is not an issue with 'high'
or 'unknown'
. 这不是
'high'
或'unknown'
。
That is, the first four rows of foo
should look like this: 也就是说,
foo
的前四行应该如下所示:
group period quality quality_lag quality_1
1: A 1 unknown NA unknown
2: B 2 low NA low
3: C 3 high NA high
4: D 4 low NA low
Any thoughts on what is causing this? 对这是什么造成的任何想法?
For starters, the Development version on GitHub already has an efficient lag function called shift
which can be used both as lag or lead (and has some additional functionality too, see ?shift
). 对于初学者来说, GitHub上的开发版本已经有了一个叫做
shift
的高效滞后函数,它既可以用作滞后也可以用作延迟(并且还有一些额外的功能,参见?shift
)。
Take also a look here as there is a bunch of other new functions that are now present in v >= 1.9.5 另请看一下,因为v> = 1.9.5中存在许多其他新功能
So under v >= 1.9.5 we could simply do 所以在v> = 1.9.5下我们可以做到
foo[, quality_lag := shift(quality), by = group]
Though even under v < 1.9.5 you could make a use of .N
instead of creating this function in the following manner 虽然即使在v <1.9.5下,您也可以使用
.N
而不是以下列方式创建此功能
foo[, quality_lag2 := c(NA, quality[-.N]), by = group]
Regarding your second question, I would recommend avoiding ifelse
all together for many reasons specified here 关于你的第二个问题,我建议
ifelse
避免ifelse
,因为这里指出了很多原因
One possible alternative would be, just to use a simple indexing as in 一种可能的替代方案是,只需使用简单的索引
foo[, quality_1 := quality][quality == 'low' & quality_lag == 'high', quality_1 := "high"]
This solution has a bit overhead, of calling [.data.table
twice but it will still be much more efficient/safe than the ifelse
solution. 这个解决方案有点开销,调用
[.data.table
两次但它仍然比ifelse
解决方案更有效/更安全。
Your problem is that ifelse(NA, 1, 2) == NA
, and when you do NA == 'low'
the result is NA
. 你的问题是
ifelse(NA, 1, 2) == NA
,当你做NA == 'low'
,结果是NA
。 An easy fix is to represent NA
as strings in your lag function. 一个简单的解决方法是在滞后函数中将
NA
表示为字符串。 Here is working version of your code: 这是您的代码的工作版本:
require(data.table)
# lag takes a column and lags it by one period,
# padding with NA
lag <- function(var) {
lagged <- c("NA",
var[1:(length(var)-1)])
return(lagged)
}
set.seed(120)
foo <- data.table(group = c('A', rep(c('B', 'C', 'D'), 5)),
period = c(1:16),
quality = c('unknown', sample(c('high', 'low', 'unknown'), 15, replace = TRUE)))
foo[, quality_lag := lag(quality), by = group]
foo[, quality_1 := ifelse(quality == 'low' & quality_lag == 'high',
'high',
quality)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.