简体   繁体   中英

Check whether a vector element of one value is placed between vector elements of two other values in R

I did not find any method of checking whether categorical value elements of a vector are between other categorical value elements. A dataframe is given:

id    letter
1     B
2     A
3     B
4     B
5     C
6     B
7     A
8     B
9     C

Everything I found is related to numerical values and to the notion of general order (rather than to index of an element in a specific vector).

I want to add a new column with boolean values (1 if B is between A and C; 0 if B is between C and A) to the dataframe,

id    letter    between
1     B         0
2     A         NA
3     B         1
4     B         1
5     C         NA
6     B         0
7     A         NA
8     B         1
9     C         NA

A combination of rle (run length encoding) and zoo::rollapply is one option:

library(zoo) 
d <- structure(list(id     = 1:9, 
                    letter = structure(c(2L, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 3L), 
                                       .Label = c("A", "B", "C"), 
                                       class = "factor")), 
                    class  = "data.frame", row.names = c(NA, -9L)) 
rl <- rle(as.numeric(d$letter)) 
rep(rollapply(c(NA, rl$values, NA), 
             3,
             function(x) if (x[2] == 2) 
                             ifelse(x[1] == 1 && x[3] == 3, 1, 0) 
                         else NA),
    rl$lengths)
# [1]  0 NA  1  1 NA  0 NA  1 NA

Explanation

  1. With rle you identify blocks of consecutive values.
  2. With rollapply you "roll" a function with a given window size (here 3) over a vector.
  3. Our vector rl$values contains the different elements and the function we apply to it is pretty straight forward:
    • if the second element is anything but a 2 (corresponding to B ) return NA
    • if the second element is a 2 and element 1 is an A and element 3 is a C return 1 and 0 otherwise

It's unclear from the question whether "A" and "C" must alternate, though that's implied because there is no coding for "B" between "A" and "A" or vv. Supposing that they do, for the vector

x = c("B", "A", "B", "B", "C", "B", "A", "B", "C")

map to numeric values c(A=1, B=0, C=-1) and form the cumulative sum

v = cumsum(c(A=1, B=0, C=-1)[x])

(increment by 1 when encountering "A", decrement by one when "C"). Replace positions not corresponding to "B" with NA

v[x != "B"] = NA

giving

> v
 B  A  B  B  C  B  A  B  C
 0 NA  1  1 NA  0 NA  1 NA

This could be captured as a function

fun = function(x, map = c(A = 1, B = 0, C = -1)) {
    x = map[x]
    v = cumsum(x)
    v[x != 0] = NA
    v
}

and used to transform a data.frame or tibble, eg,

tibble(x) %>% mutate(v = fun(x))

A different tidyverse possibility could be:

 df %>%
  group_by(grp = with(rle(letter), rep(seq_along(lengths), lengths))) %>%
  filter(row_number() == 1) %>%
  ungroup() %>%
  mutate(res = ifelse(lag(letter, default = first(letter)) == "A" & 
                      lead(letter, default = last(letter)) == "C", 1, 0)) %>%
  select(-letter, -grp) %>%
  full_join(df, by = c("id" = "id")) %>%
  arrange(id) %>%
  fill(res) %>%
  mutate(res = ifelse(letter != "B", NA, res))

    id   res letter
  <int> <dbl> <chr> 
1     1     0 B     
2     2    NA A     
3     3     1 B     
4     4     1 B     
5     5    NA C     
6     6     0 B     
7     7    NA A     
8     8     1 B     
9     9    NA C 

In this case it, first, groups by a run-length type ID and keeps the first rows with a given ID. Second, it checks the condition. Third, it performs a full join with the original df on "id" column. Finally, it arranges according "id", fills the missing values and assigns NA to rows where "letter" != B.

Here's one solution, which I hope is fairly easy conceptually. For 'special' cases such as B being at the top or bottom of the list, or having an A or a C on both sides, I've set such values to 0.

# Create dummy data - you use your own
df <- data.frame(id=1:100, letter=sample(c("A", "B", "C"), 100, replace=T))

# Copy down info on whether A or C is above each B
acup <- df$letter
for(i in 2:nrow(df))
  if(df$letter[i] == "B")
    acup[i] <- acup[i-1]

# Copy up info on whether A or C is below each B
acdown <- df$letter
for(i in nrow(df):2 -1)
  if(df$letter[i] == "B")
    acdown[i] <- acdown[i+1]

# Set appropriate values for column 'between'
df$between <- NA
df$between[acup == "A" & acdown == "C"] <- 1
df$between[df$letter == "B" & is.na(df$between)] <- 0   # Includes special cases

You can use lead and lag functions to know the letters before and after and then mutate as below:

library(dplyr)
df %>%
  mutate(letter_lag = lag(letter, 1),
         letter_lead = lead(letter, 1)) %>%
  mutate(between = case_when(letter_lag == "A" | letter_lead == "C" ~ 1,
                             letter_lag == "C" | letter_lead == "A" ~ 0,
                             TRUE ~ NA_real_)) %>%
  select(id, letter, between)
  id letter between
1  1      B       0
2  2      A      NA
3  3      B       1
4  4      B       1
5  5      C      NA
6  6      B       0
7  7      A      NA
8  8      B       1
9  9      C      NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM