简体   繁体   English

如果满足条件,如何对连续行进行子集化

[英]How to subset consecutive rows if they meet a condition

I am using R to analyze a number of time series (1951-2013) containing daily values of Max and Min temperatures.我正在使用 R 来分析一些包含每日最高和最低温度值的时间序列 (1951-2013)。 The data has the following structure:数据具有以下结构:

YEAR MONTH  DAY     MAX    MIN
1985     1    1    22.8    9.4
1985     1    2    28.6   11.7
1985     1    3    24.7   12.2
1985     1    4    17.2    8.0
1985     1    5    17.9    7.6
1985     1    6    17.7    8.1

I need to find the frequency of heat waves based on this definition: A period of three or more consecutive days ‎with a daily maximum and minimum temperature exceeding the 90th percentile of the maximum ‎and minimum temperatures for all days in the studied period.我需要根据以下定义找到热浪的频率:连续三天或更多天的时间段,每日最高和最低温度超过研究期间所有天数的最高和最低温度的 90%。

Basically, I want to subset those consecutive days (three or more) when the Max and Min temp exceed a threshold value.基本上,当最高和最低温度超过阈值时,我想对连续几天(三天或更多)进行子集化。 The output would be something like this:输出将是这样的:

YEAR MONTH   DAY     MAX     MIN
1989     7    18    45.0    23.5
1989     7    19    44.2    26.1
1989     7    20    44.7    24.4
1989     7    21    44.6    29.5
1989     7    24    44.4    31.6
1989     7    25    44.2    26.7
1989     7    26    44.5    25.0
1989     7    28    44.8    26.0
1989     7    29    44.8    24.6
1989     8    19    45.0    24.3
1989     8    20    44.8    26.0
1989     8    21    44.4    24.0
1989     8    22    45.2    25.0

I have tried the following to subset my full dataset to just the days that exceed the 90th percentile temperature:我尝试了以下将我的完整数据集子集到超过 90% 温度的天数:

HW<- subset(Mydata, Mydata$MAX >= (quantile(Mydata$MAX,.9)) &
                    Mydata$MIN >= (quantile(Mydata$MIN,.9)))

However, I got stuck in how I can subset only consecutive days that have met the condition.但是,我陷入了如何仅对满足条件的连续天数进行子集化的问题。

An approach with data.table which is slightly different from @jlhoward's approach (using the same data):使用data.table的方法与data.table的方法略有不同(使用相同的数据):

library(data.table)

setDT(df)
df[, hotday := +(MAX>=44.5 & MIN>=24.5)
   ][, hw.length := with(rle(hotday), rep(lengths,lengths))
     ][hotday == 0, hw.length := 0]

this produces a datatable with a heat wave length variable ( hw.length ) instead of a TRUE / FALSE variable for a specific heat wave length:这将生成一个数据表,其中包含一个热浪长度变量 ( hw.length ) 而不是特定热浪长度的TRUE / FALSE变量:

> df
    YEAR MONTH DAY  MAX  MIN hotday hw.length
 1: 1989     7  18 45.0 23.5      0         0
 2: 1989     7  19 44.2 26.1      0         0
 3: 1989     7  20 44.7 24.4      0         0
 4: 1989     7  21 44.6 29.5      1         1
 5: 1989     7  22 44.4 31.6      0         0
 6: 1989     7  23 44.2 26.7      0         0
 7: 1989     7  24 44.5 25.0      1         3
 8: 1989     7  25 44.8 26.0      1         3
 9: 1989     7  26 44.8 24.6      1         3
10: 1989     7  27 45.0 24.3      0         0
11: 1989     7  28 44.8 26.0      1         1
12: 1989     7  29 44.4 24.0      0         0
13: 1989     7  30 45.2 25.0      1         1

I may be missing something here but I don't see the point of subsetting beforehand.我可能在这里遗漏了一些东西,但我没有事先看到子集的意义。 If you have data for every day, in chronological order, you can use run length encoding (see the docs on the rle(...) function).如果您每天都有数据,按时间顺序排列,您可以使用运行长度编码(请参阅有关rle(...)函数的文档)。

In this example we create an artificial data set and define "heat wave" as MAX >= 44.5 and MIN >= 24.5.在本例中,我们创建了一个人工数据集并将“热浪”定义为 MAX >= 44.5 和 MIN >= 24.5。 Then:然后:

# example data set
df <- data.frame(YEAR=1989, MONTH=7, DAY=18:30, 
                 MAX=c(45, 44.2, 44.7, 44.6, 44.4, 44.2, 44.5, 44.8, 44.8, 45, 44.8, 44.4, 45.2),
                 MIN=c(23.5, 26.1, 24.4, 29.5, 31.6, 26.7, 25, 26, 24.6, 24.3, 26, 24, 25))

r <- with(with(df, rle(MAX>=44.5 & MIN>=24.5)),rep(lengths,lengths))
df$heat.wave <- with(df,MAX>=44.5&MIN>=24.5) & (r>2)
df
#    YEAR MONTH DAY  MAX  MIN heat.wave
# 1  1989     7  18 45.0 23.5     FALSE
# 2  1989     7  19 44.2 26.1     FALSE
# 3  1989     7  20 44.7 24.4     FALSE
# 4  1989     7  21 44.6 29.5     FALSE
# 5  1989     7  22 44.4 31.6     FALSE
# 6  1989     7  23 44.2 26.7     FALSE
# 7  1989     7  24 44.5 25.0      TRUE
# 8  1989     7  25 44.8 26.0      TRUE
# 9  1989     7  26 44.8 24.6      TRUE
# 10 1989     7  27 45.0 24.3     FALSE
# 11 1989     7  28 44.8 26.0     FALSE
# 12 1989     7  29 44.4 24.0     FALSE
# 13 1989     7  30 45.2 25.0     FALSE

This creates a column, heat.wave which is TRUE if there was a heat wave on that day.这将创建一个列heat.wave如果当天有heat.wave ,则该heat.wave TRUE If you need to extract only the hw days, use如果您只需要提取 hw 天数,请使用

df[df$heat.wave,]
#   YEAR MONTH DAY  MAX  MIN heat.wave
# 7 1989     7  24 44.5 25.0      TRUE
# 8 1989     7  25 44.8 26.0      TRUE
# 9 1989     7  26 44.8 24.6      TRUE

Your question really boils down to finding groupings of 3+ consecutive days in your subsetted dataset, removing all remaining data.您的问题实际上归结为在子集化数据集中查找连续 3 天以上的分组,删除所有剩余数据。

Let's consider an example where we would want to keep some rows and remove others:让我们考虑一个示例,我们希望保留一些行并删除其他行:

dat <- data.frame(year = 1989, month=c(6, 7, 7, 7, 7, 7, 8, 8, 8, 10, 10), day=c(12, 11, 12, 13, 14, 21, 5, 6, 7, 12, 13))
dat
#    year month day
# 1  1989     6  12
# 2  1989     7  11
# 3  1989     7  12
# 4  1989     7  13
# 5  1989     7  14
# 6  1989     7  21
# 7  1989     8   5
# 8  1989     8   6
# 9  1989     8   7
# 10 1989    10  12
# 11 1989    10  13

I've excluded the temperature data, because I'm assuming we've already subsetted to just the days that exceed the 90th percentile using the code from your question.我已经排除了温度数据,因为我假设我们已经使用您问题中的代码将超过 90% 的天数进行了子集化。

In this dataset there is a 4-day heat wave in July and a three-day heat wave in August.在这个数据集中,7 月有 4 天的热浪,8 月有 3 天的热浪。 The first step would be to convert the data to date objects and compute the number of days between consecutive observations (I assume the data is already ordered by day here):第一步是将数据转换为日期对象并计算连续观察之间的天数(我假设这里的数据已经按天排序):

dates <- as.Date(paste(dat$year, dat$month, dat$day, sep="-"))
(dd <- as.numeric(difftime(tail(dates, -1), head(dates, -1), units="days")))
# [1] 29  1  1  1  7 15  1  1 66  1

We're close, because now we can see the time periods where there were multiple date gaps of 1 day -- these are the ones we want to grab.我们已经接近了,因为现在我们可以看到存在多个 1 天日期间隔的时间段——这些是我们想要获取的时间段。 We can use the rle function to analyze runs of the number 1, keeping only the runs of length 2 or more:我们可以使用rle函数来分析数字 1 的运行,只保留长度为 2 或更多的运行:

(valid.gap <- with(rle(dd == 1), rep(values & lengths >= 2, lengths)))
# [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE

Finally, we can subset the dataset to just the days that were on either side of a 1-day date gap that is part of a heat wave:最后,我们可以将数据集子集到作为热浪一部分的 1 天日期间隔两侧的天数:

dat[c(FALSE, valid.gap) | c(valid.gap, FALSE),]
#   year month day
# 2 1989     7  11
# 3 1989     7  12
# 4 1989     7  13
# 5 1989     7  14
# 7 1989     8   5
# 8 1989     8   6
# 9 1989     8   7

A simple approach, not full vectorized..一种简单的方法,而不是完全矢量化..

# play data
year <- c("1960")
month <- c(rep(1,30), rep(2,30), rep(3,30))
day <- rep(1:30,3)
maxT <- round(runif(90, 20, 22),1)
minT <- round(runif(90, 10, 12),1)

df <- data.frame(year, month, day, maxT, minT)

# target and tricky data...
df[1:3, 4] <- 30
df[1:4, 5] <- 14
df[10:13, 4] <- 30
df[10:11, 5] <- 14

# limits
df$maxTope <- df$maxT - quantile(df$maxT,0.9)
df$minTope <- df$minT - quantile(df$minT,0.9)

# define heat day
df$heat <- ifelse(df$maxTope > 0 & df$minTope >0, 1, 0)

# count heat day2
for(i in 2:dim(df)[1]){ 
    df$count[1] <- ifelse(df$heat[1] == 1, 1, 0)
    df$count[i] <- ifelse(df$heat[i] == 1, df$count[i-1]+1, 0)
}

# select last day of heat wave (and show the number of days in $count)
df[which(df$count >= 3),]

Here's a quick little solution:这是一个快速的小解决方案:

is_High_Temp <- ((quantile(Mydata$MAX,.9)) &
                    Mydata$MIN >= (quantile(Mydata$MIN,.9)))
start_of_a_series <- c(T,is_High_Temp[-1] != is_High_Temp[-length(x)]) # this is the tricky part
series_number <- cumsum(start_of_a_series) 
series_length <- ave(series_number,series_number,FUN=length())
is_heat_wave  <-  series_length >= 3 & is_High_Temp 

A solution with dplyr , also using rle() dplyr 的解决方案,也使用rle()

library(dplyr)

cond <- expr(MAX >= 44.5 & MIN >= 24.5)

df %>% 
  mutate(heatwave = 
           rep(rle(!!cond)$values & rle(!!cond)$lengths >= 3, 
               rle(!!cond)$lengths)) %>%
  filter(heatwave)

#>   YEAR MONTH DAY  MAX  MIN heatwave
#> 1 1989     7  24 44.5 25.0     TRUE
#> 2 1989     7  25 44.8 26.0     TRUE
#> 3 1989     7  26 44.8 24.6     TRUE

Created on 2020-05-16 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 5 月 16 日创建

data数据

#devtools::install_github("alistaire47/read.so")
df <- read.so::read.so("YEAR MONTH   DAY     MAX     MIN
1989     7    18    45.0    23.5
1989     7    19    44.2    26.1
1989     7    20    44.7    24.4
1989     7    21    44.6    29.5
1989     7    24    44.4    31.6
1989     7    25    44.2    26.7
1989     7    26    44.5    25.0
1989     7    28    44.8    26.0
1989     7    29    44.8    24.6
1989     8    19    45.0    24.3
1989     8    20    44.8    26.0
1989     8    21    44.4    24.0
1989     8    22    45.2    25.0")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM