简体   繁体   中英

Extract time against Max and Min values in r

For the table below, I would like to extract the time against the peak and trough of the values. The dataset is analogous to a groundwater level which is expected to peak coinciding a rainfall and gradually drop down until the next rainfall event. Here, I'd like to extract the records 1, 5 and 9 as Min Time (trough) and records 2 and 7 as Max time (peak). The peaks and troughs are defined as shown in the snapshot where in each of the colour is expected to be seen as an output. In instances of same values on the 'CS', I'm hoping to do an average of lead of three records for peak while for the trough an average of lag of the previous three records

       structure(list(TIMESTAMP = c("25/06/2021 00:00", "25/06/2021 04:00", 
"25/06/2021 08:00", "25/06/2021 12:00", "25/06/2021 16:00", "25/06/2021 20:00", 
"26/06/2021 00:00", "27/06/2021 04:00", "27/06/2021 08:00"), 
    CS = c(70L, 138L, 120L, 100L, 80L, 110L, 150L, 100L, 60L)), row.names = c(NA, 
9L), class = "data.frame")

From a read of other posts, I started developing like the one below using lubridate and tidyr, however, I guess I'm lost and realising it's time for me to get some help. Any help highly appreciated.

library(lubridate)
library(tidyr)
d <- mydata %>%
  gather("CS","Temp",-TIMESTAMP) %>%
  group_by(Date = date(TIMESTAMP), HoD = hour(TIMESTAMP)) %>%
  mutate_at(.vars = "Temp", .funs = list(Min = min, Max = max)) %>%
  filter(Temp == Min | Temp == Max) %>%
  arrange(CS, TIMESTAMP) %>%
  distinct(Temp, .keep_all = T) %>%
  mutate(MinMax = ifelse(Temp == Min, "MinTime", "MaxTime")) %>%
  spread("MinMax", "TIMESTAMP")

Expected Output:

 Min_Time  CS_Min  Max_Time  CS_Max 
1 25/06/2021 00:00  70  25/06/2021 04:00 138
2 25/06/2021 16:00  80  25/06/2021 04:00 138
3 25/06/2021 16:00  80  26/06/2021 00:00 150       
4 27/06/2021 08:00  60  NA NA

在此处输入图像描述

It does not have to be hard to find single peaks and troughs, but a complicating factor is peaks and troughs that are more than one observation wide. Therefore I have added one such instance to your example data:

example input

df <- data.frame(CS = c(70L, 138L, 138L, 120L, 100L, 80L, 110L, 150L, 100L, 70L),
                 TIMESTAMP = c("25/06/2021 00:00", "25/06/2021 04:00", "25/06/2021 08:00", "25/06/2021 09:00", "25/06/2021 12:00",
                               "25/06/2021 16:00", "25/06/2021 20:00", "26/06/2021 00:00", "27/06/2021 04:00", "27/06/2021 08:00") )

> df
    CS        TIMESTAMP
1   70 25/06/2021 00:00
2  138 25/06/2021 04:00
3  138 25/06/2021 08:00
4  120 25/06/2021 09:00
5  100 25/06/2021 12:00
6   80 25/06/2021 16:00
7  110 25/06/2021 20:00
8  150 26/06/2021 00:00
9  100 27/06/2021 04:00
10  70 27/06/2021 08:00

step 1: identify clusters of consecutive row-ids that have the same CS

We will address the rows by their row id; ie the number in front of each row in the output of df above.

With data.table::rleid() we can quickly cluster the row ids of consecutive rows with the same CS-value. We then create a copy of df where we remove those 'duplicate' rows ( df_unique ):

library('data.table')
library('dplyr')
    
rleids <- rleid(df$CS)        # generate run-length type cluster ids

df_unique <- df %>%                       # create temporary df with only first of each cluster:
  mutate(org_rowid = row_number()) %>%    # save the original row number
  filter(CS != lag(CS) | is.na(lag(CS)))  # filter out rows that have identical CS as previous row

Results:

> rleids
 [1] 1 2 2 3 4 5 6 7 8 9

> df_unique
   CS        TIMESTAMP org_rowid
1  70 25/06/2021 00:00         1
2 138 25/06/2021 04:00         2
3 120 25/06/2021 09:00         4
4 100 25/06/2021 12:00         5
5  80 25/06/2021 16:00         6
6 110 25/06/2021 20:00         7
7 150 26/06/2021 00:00         8
8 100 27/06/2021 04:00         9
9  70 27/06/2021 08:00        10

step 2: identify (clusters of) rows higher / lower than the values around them

In the data frame with 'duplicate' values removed, we can easily find the peaks and troughs.

I've added | is.na( ... ) | is.na( ... ) to also include the first and last rows as either a peak or trough, even though we don't know what lies beyond them.

# find (row number of) peaks and troughs in df_unique
peaks <- df_unique %>%
  filter( (CS > lead(CS) | is.na(lead(CS))) & (CS > lag(CS) | is.na(lag(CS))) ) %>%
  pull(org_rowid)

troughs <- df_unique %>%
  filter( (CS < lead(CS) | is.na(lead(CS))) & (CS < lag(CS) | is.na(lag(CS))) ) %>%
  pull(org_rowid)

Results:

> peaks
[1] 2 8

> troughs
[1]  1  6 10

step 3: add row ids of 'duplicate' rows

As the final step, we will use the data in rleids to add row ids for all rows in each cluster. Since we can't preserve clusters in a vector, we will create two lists.

# add row numbers in the same cluster
peaks <- lapply(peaks, function(x) which(rleids == rleids[x]))
troughs <- lapply(troughs, function(x) which(rleids == rleids[x]))

Results:

> peaks
[[1]]
[1] 2 3

[[2]]
[1] 8

> troughs
[[1]]
[1] 1

[[2]]
[1] 6

[[3]]
[1] 10

optional step 4: cleanup

# cleanup
rm(df_unique, rleids)

complete code

library('dplyr')
library('data.table')

df <- data.frame(CS = c(70L, 138L, 138L, 120L, 100L, 80L, 110L, 150L, 100L, 70L),
                 TIMESTAMP = c("25/06/2021 00:00", "25/06/2021 04:00", "25/06/2021 08:00", "25/06/2021 09:00", "25/06/2021 12:00",
                               "25/06/2021 16:00", "25/06/2021 20:00", "26/06/2021 00:00", "27/06/2021 04:00", "27/06/2021 08:00") )

# step 1: identify clusters of consecutive row-ids thathave the same CS

rleids <- rleid(df$CS)        # generate run-length type cluster ids

df_unique <- df %>%                       # create temporary df with only first of each cluster:
  mutate(org_rowid = row_number()) %>%    # save the original row number
  filter(CS != lag(CS) | is.na(lag(CS)))  # filter out rows that have identical CS as previous row
  
# step 2: identify (clusters of) rows higher / lower than the values around them

peaks <- df_unique %>%
  filter( (CS > lead(CS) | is.na(lead(CS))) & (CS > lag(CS) | is.na(lag(CS))) ) %>%
  pull(org_rowid)

troughs <- df_unique %>%
  filter( (CS < lead(CS) | is.na(lead(CS))) & (CS < lag(CS) | is.na(lag(CS))) ) %>%
  pull(org_rowid)

# step 3: add row numbers in the same cluster

peaks <- lapply(peaks, function(x) which(rleids == rleids[x]))
troughs <- lapply(troughs, function(x) which(rleids == rleids[x]))

# step 4: cleanup

rm(df_unique, rleids)

Results:

> peaks
[[1]]
[1] 2 3

[[2]]
[1] 8

> troughs
[[1]]
[1] 1

[[2]]
[1] 6

[[3]]
[1] 10

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM