简体   繁体   中英

How to create a conditional dummy in R?

I have a dataframe of time series data with daily observations of temperatures. I need to create a dummy variable that counts each day that has temperature above a threshold of 5C. This would be easy in itself, but an additional condition exists: counting starts only after ten consecutive days above the threshold occurs. Here's an example dataframe:

df <- data.frame(date = seq(365), 
         temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))

I think I got it done, but with too many loops for my liking. This is what I did:

df$dummyUnconditional <- 0
df$dummyHead <- 0
df$dummyTail <- 0

for(i in 1:nrow(df)){
    if(df$temp[i] > 5){
        df$dummyUnconditional[i] <- 1
    }
}

for(i in 1:(nrow(df)-9)){
    if(sum(df$dummyUnconditional[i:(i+9)]) == 10){
        df$dummyHead[i] <- 1
    }
}

for(i in 9:nrow(df)){
    if(sum(df$dummyUnconditional[(i-9):i]) == 10){
        df$dummyTail[i] <- 1
    }
}

df$dummyConditional <- ifelse(df$dummyHead == 1 | df$dummyTail == 1, 1, 0)

Could anyone suggest simpler ways for doing this?

Here's a base R option using rle :

df$dummy <- with(rle(df$temp > 5), rep(as.integer(values & lengths >= 10), lengths))

Some explanation: The task is a classic use case for the run length encoding ( rle ) function, imo. We first check if the value of temp is greater than 5 (creating a logical vector) and apply rle on that vector resulting in:

> rle(df$temp > 5)
#Run Length Encoding
#  lengths: int [1:7] 66 1 1 225 2 1 69
#  values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...

Now we want to find those cases where the values is TRUE (ie temp is greater than 5) and where at the same time the lengths is greater than 10 (ie at least ten consecutive temp values are greater than 5). We do this by running:

values & lengths >= 10

And finally, since we want to return a vector of the same lengths as nrow(df) , we use rep(..., lengths) and as.integer in order to return 1/0 instead of TRUE / FALSE .

I think you could use a combination of a simple ifelse and the roll apply function in the zoo package to achieve what you are looking for. The final step just involves padding the result to account for the first N-1 days where there isnt enough information to fill the window.

library(zoo)

df <- data.frame(date = seq(365), 
                 temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))

df$above5 <- ifelse(df$temp > 5, 1, 0)
temp <- rollapply(df$above5, 10, sum)
df$conseq <- c(rep(0, 9),temp)

I would do this:

set.seed(42)
df <- data.frame(date = seq(365), 
                 temp = -30 + 0.65*seq(365) - 0.0018*seq(365)^2 + rnorm(365))
thr <- 5
df$dum <- 0

#find first 10 consecutive values above threshold
test1 <- filter(df$temp > thr, rep(1,10), sides = 1) == 10L
test1[1:9] <- FALSE
n <- which(cumsum(test1) == 1L)

#count days above threshold after that
df$dum[(n+1):nrow(df)] <- cumsum(df$temp[(n+1):nrow(df)] > thr)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM