简体   繁体   中英

Select rows in dataframe conditional on a switching boolean variable in another column in R

Let's say I have the following dataframe in R:

set.seed(23)

# Create sample data
time = 1:15
x = rnorm(n = 15) 
y = rnorm(n = 15)
boolean = sample(c(TRUE,FALSE), 15, TRUE)
df <- data.frame(time, x, y, boolean)

# Output
> df
time           x            y boolean
1     1  0.19321233  0.308136896    TRUE
2     2 -0.43468211 -0.520178315    TRUE
3     3  0.91326710 -0.442313801   FALSE # select
4     4  1.79338809 -0.599312812    TRUE # select
5     5  0.99660511  1.294577829    TRUE
6     6  1.10749049  0.835391247    TRUE
7     7 -0.27808628 -0.566015100    TRUE
8     8  1.01920549  0.788419350   FALSE # select
9     9  0.04543718 -1.165929326    TRUE # select
10   10  1.57577959 -0.530820006   FALSE # select
11   11  0.21828845 -0.001058737   FALSE
12   12 -1.04653534 -0.512562365   FALSE
13   13 -0.28868865  1.242867513   FALSE
14   14  0.48155029 -0.660582851   FALSE
15   15 -1.21637643  0.166624215    TRUE # select

Problem

I would like to select all the rows, in which the boolean in the 4th column switches from FALSE to TRUE or vice versa (indicated in the dataframe above).

Question

How do I do this in R?

Attempt

I have found the select() and the select_if() functions in the tidyverse package , however, I am not able to select the values based on the previous value in the column.

We can use rle to create a counter which increments for every change in boolean value. We use duplicated and select only the first row for each counter. This will also select the first row but since it is not an actual change in boolean value, we remove that row (by using [-1] ).

df[!duplicated(with(rle(df$boolean), rep(seq_along(values), lengths))), ][-1, ]

#   time           x            y boolean
#2     2 -0.43468211 -0.566015100    TRUE
#3     3  0.91326710  0.788419350   FALSE
#6     6  1.10749049 -0.001058737    TRUE
#8     8  1.01920549  1.242867513   FALSE
#9     9  0.04543718 -0.660582851    TRUE
#13   13 -0.28868865 -1.146665860   FALSE
#15   15 -1.21637643 -0.202111683    TRUE

The same logic can be applied using data.table::rleid which will make it a bit shorter

df[!duplicated(data.table::rleid(df$boolean)), ][-1, ]

In dplyr , we can create groups using lag and cumsum and select first row of every group.

library(dplyr)
df %>%
  group_by(group = cumsum(boolean != lag(boolean, default = first(boolean)))) %>%
  slice(1L) %>%
  ungroup %>%
  slice(-1L) %>%
  select(-group)

data

df <- structure(list(time = 1:15, x = c(0.19321233, -0.43468211, 0.9132671, 
1.79338809, 0.99660511, 1.10749049, -0.27808628, 1.01920549, 
0.04543718, 1.57577959, 0.21828845, -1.04653534, -0.28868865, 
0.48155029, -1.21637643), y = c(0.835391247, -0.5660151, 0.78841935, 
-1.165929326, -0.530820006, -0.001058737, -0.512562365, 1.242867513, 
-0.660582851, 0.166624215, -0.55320524, 0.098181415, -1.14666586, 
-1.249927257, -0.202111683), boolean = c(FALSE, TRUE, FALSE, 
FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, 
FALSE, TRUE)), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14","15"))

Here's another base solution::

df[c(FALSE, diff(df$boolean) != 0), ]

   time           x            y boolean
2     2 -0.43468211 -0.566015100    TRUE
3     3  0.91326710  0.788419350   FALSE
6     6  1.10749049 -0.001058737    TRUE
8     8  1.01920549  1.242867513   FALSE
9     9  0.04543718 -0.660582851    TRUE
13   13 -0.28868865 -1.146665860   FALSE
15   15 -1.21637643 -0.202111683    TRUE

This relies on taking the difference between TRUE and FALSE . If it's changing, the difference will be either -1 or 1.

Using the helper function shift() from the package (and the correct data provided by Ronak):

subset(df, boolean != shift(boolean, fill = boolean[1]))

   time           x            y boolean
2     2 -0.43468211 -0.566015100    TRUE
3     3  0.91326710  0.788419350   FALSE
6     6  1.10749049 -0.001058737    TRUE
8     8  1.01920549  1.242867513   FALSE
9     9  0.04543718 -0.660582851    TRUE
13   13 -0.28868865 -1.146665860   FALSE
15   15 -1.21637643 -0.202111683    TRUE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM