简体   繁体   English

data.table(R)中的ifelse行为

[英]Ifelse behavior within data.table (R)

I have a data.table full of some consumer products. 我有一个data.table充满了一些消费产品。 I've created some distinction for the products as 'low' , 'high' , or 'unknown' quality. 我为产品创造了一些区别,即'low''high''unknown'质量。 The data are time series, and I'm interested in smoothing out some seasonality in the data. 数据是时间序列,我有兴趣平滑数据中的一些季节性。 If a product's raw classification (the classification churned out by the algorithm I used to determine quality) is 'low' quality in period X, but its raw classification was 'high' quality in period X-1, I'm reclassifying that product as 'high' quality for period X. This process is done within some sort of product group distinction. 如果产品的原始分类(我用来确定质量的算法所产生的分类)在X期中是'low'质量,但其原始分类在X-1期间是'high'质量,我将该产品重新分类为X期间的'high'质量。这个过程是在某种产品组的区别内完成的。

To accomplish this, I've got something like the following: 要做到这一点,我有类似以下内容:

require(data.table)

# lag takes a column and lags it by one period,
# padding with NA

lag <- function(var) {
    lagged <- c(NA, 
                var[1:(length(var)-1)])
    return(lagged)
}

set.seed(120)

foo <- data.table(group = c('A', rep(c('B', 'C', 'D'), 5)),
                  period = c(1:16),
                  quality = c('unknown', sample(c('high', 'low', 'unknown'), 15, replace = TRUE)))

foo[, quality_lag := lag(quality), by = group]

foo[, quality_1 := ifelse(quality == 'low' & quality_lag == 'high',
                          'high',
                          quality)]

Taking a look at foo : 看看foo

    group period quality quality_lag quality_1
 1:     A      1 unknown          NA   unknown
 2:     B      2     low          NA        NA
 3:     C      3    high          NA      high
 4:     D      4     low          NA        NA
 5:     B      5 unknown         low   unknown
 6:     C      6    high        high      high
 7:     D      7     low         low       low
 8:     B      8 unknown     unknown   unknown
 9:     C      9    high        high      high
10:     D     10 unknown         low   unknown
11:     B     11 unknown     unknown   unknown
12:     C     12     low        high      high
13:     D     13 unknown     unknown   unknown
14:     B     14    high     unknown      high
15:     C     15    high         low      high
16:     D     16 unknown     unknown   unknown

So, quality_1 is mostly what I want. 因此, quality_1主要是我想要的。 If period X is 'low' and period X-1 is 'high' , we see the reclassification to 'high' occurs and everything is left mostly intact from quality . 如果期间X为'low'且期间X-1为'high' ,我们会看到重新分类为'high' ,并且一切都保持quality完好无损。 However, when quality_lag is NA, 'low' gets reclassified to NA in quality_1 . 但是,当quality_lag为NA时, 'low'会在quality_1重新分类为NA This is not an issue with 'high' or 'unknown' . 这不是'high''unknown'

That is, the first four rows of foo should look like this: 也就是说, foo的前四行应该如下所示:

   group period quality quality_lag quality_1
 1:     A      1 unknown          NA   unknown
 2:     B      2     low          NA       low
 3:     C      3    high          NA      high
 4:     D      4     low          NA       low

Any thoughts on what is causing this? 对这是什么造成的任何想法?

For starters, the Development version on GitHub already has an efficient lag function called shift which can be used both as lag or lead (and has some additional functionality too, see ?shift ). 对于初学者来说, GitHub上开发版本已经有了一个叫做shift的高效滞后函数,它既可以用作滞后也可以用作延迟(并且还有一些额外的功能,参见?shift )。

Take also a look here as there is a bunch of other new functions that are now present in v >= 1.9.5 另请看一下因为v> = 1.9.5中存在许多其他新功能

So under v >= 1.9.5 we could simply do 所以在v> = 1.9.5下我们可以做到

foo[, quality_lag := shift(quality), by = group]

Though even under v < 1.9.5 you could make a use of .N instead of creating this function in the following manner 虽然即使在v <1.9.5下,您也可以使用.N而不是以下列方式创建此功能

foo[, quality_lag2 := c(NA, quality[-.N]), by = group]

Regarding your second question, I would recommend avoiding ifelse all together for many reasons specified here 关于你的第二个问题,我建议ifelse避免ifelse ,因为这里指出了很多原因

One possible alternative would be, just to use a simple indexing as in 一种可能的替代方案是,只需使用简单的索引

foo[, quality_1 := quality][quality == 'low' & quality_lag == 'high', quality_1 := "high"]

This solution has a bit overhead, of calling [.data.table twice but it will still be much more efficient/safe than the ifelse solution. 这个解决方案有点开销,调用[.data.table两次但它仍然比ifelse解决方案更有效/更安全。

Your problem is that ifelse(NA, 1, 2) == NA , and when you do NA == 'low' the result is NA . 你的问题是ifelse(NA, 1, 2) == NA ,当你做NA == 'low' ,结果是NA An easy fix is to represent NA as strings in your lag function. 一个简单的解决方法是在滞后函数中将NA表示为字符串。 Here is working version of your code: 这是您的代码的工作版本:

require(data.table)

# lag takes a column and lags it by one period,
# padding with NA

lag <- function(var) {
    lagged <- c("NA", 
                var[1:(length(var)-1)])
    return(lagged)
}

set.seed(120)

foo <- data.table(group = c('A', rep(c('B', 'C', 'D'), 5)),
                  period = c(1:16),
                  quality = c('unknown', sample(c('high', 'low', 'unknown'), 15, replace = TRUE)))

foo[, quality_lag := lag(quality), by = group]

foo[, quality_1 := ifelse(quality == 'low' & quality_lag == 'high',
                          'high',
                          quality)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM