简体   繁体   中英

rolling percentile for conditional selections in r

I have a data.frame with daily maximum and minimum temperatures for 40 years and need to select all days that have maximum temperature above 90th percentile of maximum temperature and minimum temperatures above the 85th percentile of minimum temperature.

I was able to do that

> head(df)
  YEAR MONTH DAY     Date MEAN  MAX  MIN
1 1965     1   1 1/1/1965   NA 27.0 17.0
2 1965     1   2 1/2/1965 24.0 28.0 20.7
3 1965     1   3 1/3/1965 19.9 23.7 16.2
4 1965     1   4 1/4/1965 18.0 23.4 12.0
5 1965     1   5 1/5/1965 19.7 24.0 14.0
6 1965     1   6 1/6/1965 18.6 24.0 13.0


df[, hotday := +(df$MAX>=(quantile(df$MAX,.90, na.rm = T, type = 6)) & df$MIN>=(quantile(df$MIN,.85, na.rm = T, type = 6)))
              ] [, length := with(rle(hotday), rep(lengths,lengths)) # to calculate lenght so I can select consecutive days only
                 ] [hotday==0, length:=0][!!hotday, Highest_Mean := max(MEAN) , rleid(length)][] # to find the highest Mean temp for each consecutive group

But I need to do the same thing using centered rolling percentiles for every 15 days (ie, for a given day, the 90th percentile of maximum temperature is the 90th percentile of the historical data for a 15-day window centered on that day)

I mean that the percentile to be calculated from the historical data of each calendar day using 15-days calendar window. That is, there are 365 days so for day 118 I will use the historical data for day 111, 112,..... to day 125. So in my case, I have data for 40 years so the 15-day window will yield a total sample size of 40 years × 15 days = 600 for each calendar day. The moving window is based on the calendar day, not the time series

Any thought please

What about something like this to select the rows you want ?

Since you want a sliding window of 15 days centered at the day of interest, you will always have windows of 7 preceding days + day of interest + 7 following days. Because of this constraint, the first 7 and the last 7 days (rows) of the dataset are excluded and forced == FALSE { rep(FALSE, 7) }

the code included in the sapply() call will test each day (starting from day n.(7+1=8) ) against the 15-day sliding window (as defined before) and check if the max temperature is higher than the 90th percentile of that window (test1). A similar test (test2) is executed looking at the MIN temp. If one of the two tests is TRUE, TRUE is returned (otherwise, FALSE is outputted. You can easily adapt this to your needs).

The resulting vector (stored in the KEEP vector) includes booleans TRUE/FALSE that can be used for subsetting the initial dataframe.

set.seed(111)
df <- data.frame(MIN=sample(50:70, size = 50, replace = T),
                 MAX=sample(70:90, size = 50, replace = T))
head(df)

KEEP <- c(rep(FALSE, 7),
          sapply(8:(length(df$MAX) - 7), (function(i){
            test1 <- df$MAX[i] >= as.numeric(quantile(df$MAX[(i-7):(i+7)], 0.9, na.rm = TRUE))
            test2 <- df$MIN[i] <= as.numeric(quantile(df$MIN[(i-7):(i+7)], 0.15, na.rm = TRUE))
            test1 | test2
          })),
          rep(FALSE, 7))
head(KEEP)
df <- df[KEEP,] 
df  

This should return

   MIN MAX
10  51  86
13  51  73
14  50  75
15  53  89
22  55  83
28  55  90
31  51  72
32  60  88
37  52  84
42  56  87

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM