简体   繁体   English

针对 r 中的最大值和最小值提取时间

[英]Extract time against Max and Min values in r

For the table below, I would like to extract the time against the peak and trough of the values.对于下表,我想根据值的峰值和谷值提取时间。 The dataset is analogous to a groundwater level which is expected to peak coinciding a rainfall and gradually drop down until the next rainfall event.该数据集类似于地下水位,预计将在降雨时达到峰值并逐渐下降,直到下一次降雨事件。 Here, I'd like to extract the records 1, 5 and 9 as Min Time (trough) and records 2 and 7 as Max time (peak).在这里,我想将记录 1、5 和 9 提取为最小时间(波谷),将记录 2 和 7 提取为最大时间(峰值)。 The peaks and troughs are defined as shown in the snapshot where in each of the colour is expected to be seen as an output.峰值和谷值的定义如快照中所示,其中每种颜色都应被视为输出。 In instances of same values on the 'CS', I'm hoping to do an average of lead of three records for peak while for the trough an average of lag of the previous three records在“CS”上具有相同值的情况下,我希望在峰值时平均领先三个记录,而在低谷时平均滞后前三个记录

       structure(list(TIMESTAMP = c("25/06/2021 00:00", "25/06/2021 04:00", 
"25/06/2021 08:00", "25/06/2021 12:00", "25/06/2021 16:00", "25/06/2021 20:00", 
"26/06/2021 00:00", "27/06/2021 04:00", "27/06/2021 08:00"), 
    CS = c(70L, 138L, 120L, 100L, 80L, 110L, 150L, 100L, 60L)), row.names = c(NA, 
9L), class = "data.frame")

From a read of other posts, I started developing like the one below using lubridate and tidyr, however, I guess I'm lost and realising it's time for me to get some help.通过阅读其他帖子,我开始使用 lubridate 和 tidyr 进行如下开发,但是,我想我迷路了,意识到是时候寻求帮助了。 Any help highly appreciated.任何帮助高度赞赏。

library(lubridate)
library(tidyr)
d <- mydata %>%
  gather("CS","Temp",-TIMESTAMP) %>%
  group_by(Date = date(TIMESTAMP), HoD = hour(TIMESTAMP)) %>%
  mutate_at(.vars = "Temp", .funs = list(Min = min, Max = max)) %>%
  filter(Temp == Min | Temp == Max) %>%
  arrange(CS, TIMESTAMP) %>%
  distinct(Temp, .keep_all = T) %>%
  mutate(MinMax = ifelse(Temp == Min, "MinTime", "MaxTime")) %>%
  spread("MinMax", "TIMESTAMP")

Expected Output:预期输出:

 Min_Time  CS_Min  Max_Time  CS_Max 
1 25/06/2021 00:00  70  25/06/2021 04:00 138
2 25/06/2021 16:00  80  25/06/2021 04:00 138
3 25/06/2021 16:00  80  26/06/2021 00:00 150       
4 27/06/2021 08:00  60  NA NA

在此处输入图像描述

It does not have to be hard to find single peaks and troughs, but a complicating factor is peaks and troughs that are more than one observation wide.找到单个波峰和波谷并不难,但一个复杂的因素是波峰和波谷的宽度超过一个观测值。 Therefore I have added one such instance to your example data:因此,我在您的示例数据中添加了一个这样的实例:

example input示例输入

df <- data.frame(CS = c(70L, 138L, 138L, 120L, 100L, 80L, 110L, 150L, 100L, 70L),
                 TIMESTAMP = c("25/06/2021 00:00", "25/06/2021 04:00", "25/06/2021 08:00", "25/06/2021 09:00", "25/06/2021 12:00",
                               "25/06/2021 16:00", "25/06/2021 20:00", "26/06/2021 00:00", "27/06/2021 04:00", "27/06/2021 08:00") )

> df
    CS        TIMESTAMP
1   70 25/06/2021 00:00
2  138 25/06/2021 04:00
3  138 25/06/2021 08:00
4  120 25/06/2021 09:00
5  100 25/06/2021 12:00
6   80 25/06/2021 16:00
7  110 25/06/2021 20:00
8  150 26/06/2021 00:00
9  100 27/06/2021 04:00
10  70 27/06/2021 08:00

step 1: identify clusters of consecutive row-ids that have the same CS步骤 1:识别具有相同 CS 的连续行 ID 的集群

We will address the rows by their row id;我们将通过行 ID 来处理行; ie the number in front of each row in the output of df above.即上面df的输出中每一行前面的数字。

With data.table::rleid() we can quickly cluster the row ids of consecutive rows with the same CS-value.使用data.table::rleid()我们可以快速地将具有相同 CS 值的连续行的行 ID 聚类。 We then create a copy of df where we remove those 'duplicate' rows ( df_unique ):然后我们创建df的副本,在其中删除那些“重复”行( df_unique ):

library('data.table')
library('dplyr')
    
rleids <- rleid(df$CS)        # generate run-length type cluster ids

df_unique <- df %>%                       # create temporary df with only first of each cluster:
  mutate(org_rowid = row_number()) %>%    # save the original row number
  filter(CS != lag(CS) | is.na(lag(CS)))  # filter out rows that have identical CS as previous row

Results:结果:

> rleids
 [1] 1 2 2 3 4 5 6 7 8 9

> df_unique
   CS        TIMESTAMP org_rowid
1  70 25/06/2021 00:00         1
2 138 25/06/2021 04:00         2
3 120 25/06/2021 09:00         4
4 100 25/06/2021 12:00         5
5  80 25/06/2021 16:00         6
6 110 25/06/2021 20:00         7
7 150 26/06/2021 00:00         8
8 100 27/06/2021 04:00         9
9  70 27/06/2021 08:00        10

step 2: identify (clusters of) rows higher / lower than the values around them第 2 步:识别高于/低于其周围值的行(集群)

In the data frame with 'duplicate' values removed, we can easily find the peaks and troughs.在删除了“重复”值的数据框中,我们可以轻松找到波峰和波谷。

I've added | is.na( ... )我已添加| is.na( ... ) | is.na( ... ) to also include the first and last rows as either a peak or trough, even though we don't know what lies beyond them. | is.na( ... )还包括第一行和最后一行作为峰值或低谷,即使我们不知道它们之外是什么。

# find (row number of) peaks and troughs in df_unique
peaks <- df_unique %>%
  filter( (CS > lead(CS) | is.na(lead(CS))) & (CS > lag(CS) | is.na(lag(CS))) ) %>%
  pull(org_rowid)

troughs <- df_unique %>%
  filter( (CS < lead(CS) | is.na(lead(CS))) & (CS < lag(CS) | is.na(lag(CS))) ) %>%
  pull(org_rowid)

Results:结果:

> peaks
[1] 2 8

> troughs
[1]  1  6 10

step 3: add row ids of 'duplicate' rows第 3 步:添加“重复”行的行 ID

As the final step, we will use the data in rleids to add row ids for all rows in each cluster.作为最后一步,我们将使用rleids中的数据为每个集群中的所有行添加行 ID。 Since we can't preserve clusters in a vector, we will create two lists.由于我们无法在向量中保留集群,因此我们将创建两个列表。

# add row numbers in the same cluster
peaks <- lapply(peaks, function(x) which(rleids == rleids[x]))
troughs <- lapply(troughs, function(x) which(rleids == rleids[x]))

Results:结果:

> peaks
[[1]]
[1] 2 3

[[2]]
[1] 8

> troughs
[[1]]
[1] 1

[[2]]
[1] 6

[[3]]
[1] 10

optional step 4: cleanup可选步骤 4:清理

# cleanup
rm(df_unique, rleids)

complete code完整代码

library('dplyr')
library('data.table')

df <- data.frame(CS = c(70L, 138L, 138L, 120L, 100L, 80L, 110L, 150L, 100L, 70L),
                 TIMESTAMP = c("25/06/2021 00:00", "25/06/2021 04:00", "25/06/2021 08:00", "25/06/2021 09:00", "25/06/2021 12:00",
                               "25/06/2021 16:00", "25/06/2021 20:00", "26/06/2021 00:00", "27/06/2021 04:00", "27/06/2021 08:00") )

# step 1: identify clusters of consecutive row-ids thathave the same CS

rleids <- rleid(df$CS)        # generate run-length type cluster ids

df_unique <- df %>%                       # create temporary df with only first of each cluster:
  mutate(org_rowid = row_number()) %>%    # save the original row number
  filter(CS != lag(CS) | is.na(lag(CS)))  # filter out rows that have identical CS as previous row
  
# step 2: identify (clusters of) rows higher / lower than the values around them

peaks <- df_unique %>%
  filter( (CS > lead(CS) | is.na(lead(CS))) & (CS > lag(CS) | is.na(lag(CS))) ) %>%
  pull(org_rowid)

troughs <- df_unique %>%
  filter( (CS < lead(CS) | is.na(lead(CS))) & (CS < lag(CS) | is.na(lag(CS))) ) %>%
  pull(org_rowid)

# step 3: add row numbers in the same cluster

peaks <- lapply(peaks, function(x) which(rleids == rleids[x]))
troughs <- lapply(troughs, function(x) which(rleids == rleids[x]))

# step 4: cleanup

rm(df_unique, rleids)

Results:结果:

> peaks
[[1]]
[1] 2 3

[[2]]
[1] 8

> troughs
[[1]]
[1] 1

[[2]]
[1] 6

[[3]]
[1] 10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM