简体   繁体   中英

Subsetting sequences in R based on certain criteria

I would like to know if there is a way of subsetting a huge R dataframe [df] so that only certain sequences remain for each group [device].

I have a dataframe [df] like this:

id   device   date                pressure    
1    B3       2020-04-15 08:00    112         
2    B3       2020-04-15 09:00    100         
3    B3       2020-04-15 10:00    89          
4    B3       2020-04-15 11:00    90          
5    B3       2020-04-15 12:00    60          
6    B3       2020-04-15 13:00    28          
7    B3       2020-04-16 09:00    120         
8    B3       2020-04-16 10:00    80          
9    B3       2020-04-16 11:00    73          
10   B3       2020-04-16 12:00    61          
11   B3       2020-04-16 13:00    30   

I would like to get only the rows where the pressure drops from 120 down to 60 [or first value lower than 60].

The expected result would be as follows:

id   device   date                pressure    group
1    B3       2020-04-15 08:00    112         1
2    B3       2020-04-15 09:00    100         1
3    B3       2020-04-15 10:00    89          1
4    B3       2020-04-15 11:00    90          1
5    B3       2020-04-15 12:00    60          1
7    B3       2020-04-16 09:00    120         2
8    B3       2020-04-16 10:00    80          2
9    B3       2020-04-16 11:00    73          2
10   B3       2020-04-16 12:00    61          2
11   B3       2020-04-16 13:00    30          2

Would this be possible? Thank you for any suggestions.

You can create a new group when the current value is greater than 60 and the previous value was less than 60 and select only the rows till we encounter first row less than equal to 60.

library(dplyr)
df %>%
  group_by(device, 
           group = cumsum(pressure > 60 & lag(pressure, default = 0) < 60)) %>%
  slice(seq_len(which.max(pressure <= 60)))

#      id device date            pressure group
#   <int> <chr>  <chr>              <int> <int>
# 1     1 B3     2020-04-1508:00      112     1
# 2     2 B3     2020-04-1509:00      100     1
# 3     3 B3     2020-04-1510:00       89     1
# 4     4 B3     2020-04-1511:00       90     1
# 5     5 B3     2020-04-1512:00       60     1
# 6     7 B3     2020-04-1609:00      120     2
# 7     8 B3     2020-04-1610:00       80     2
# 8     9 B3     2020-04-1611:00       73     2
# 9    10 B3     2020-04-1612:00       61     2
#10    11 B3     2020-04-1613:00       30     2

If you want to do it without dplyr and pipes, you can loop through the pressures to annotate the groups:

d$group=NA
d$group[1]=1
for(i in 2:dim(d)[1]){
  if(d$pressure[i]>60 & d$pressure[i-1] < 60){
    d$group[i]=d$group[i-1]+1
  } else if (d$pressure[i]>d$pressure[i-1] & d$pressure[i]<60){
    d$group[i]=d$group[i-1]+1
  } else{
    d$group[i]=d$group[i-1]
  }
}

In such an if-elise if block, you can add as many different conditions as you want (eg changing devices, changing dates,...)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM