简体   繁体   中英

Split vector by each NA in R

I have the following vector called input :

input <- c(1,2,1,NA,3,2,NA,1,5,6,NA,2,2)

[1]  1  2  1 NA  3  2 NA  1  5  6 NA  2  2

I would like to split this vector into multiple vectors by each NA. So the desired output should look like this:

> output
[[1]]
[1] 1 2 1

[[2]]
[1] 3 2

[[3]]
[1] 1 5 6

[[4]]
[1] 2 2

As you can see every time a NA appears, it splits into a new vector. So I was wondering if anyone knows how to split a vector by each NA into multiple vectors?

Using a similar logic to @tpetzoldt, but removing the NAs before the split:

split(na.omit(input), cumsum(is.na(input))[!is.na(input)])

$`0`
[1] 1 2 1

$`1`
[1] 3 2

$`2`
[1] 1 5 6

$`3`
[1] 2 2

One way could go like follows:

  1. identify the NA s
  2. do cumsum
  3. split according to the cumulative sums
  4. remove the NA s
input <- c(1,2,1,NA,3,2,NA,1,5,6,NA,2,2)
tmp <- cumsum(is.na(input))
lapply(split(input, tmp), na.omit)

This one is too verbose and overcomplicated, but for me it is easier to think of such problems, just wanted to share:

library(tidyverse)

tibble(input) %>% 
  group_by(id = cumsum(is.na(input))) %>% 
  na.omit %>% 
  group_split() %>% 
  map(.,~(.x %>%select(-id))) %>% 
  map(.,~(.x %>%pull))

[[1]]
[1] 1 2 1

[[2]]
[1] 3 2

[[3]]
[1] 1 5 6

[[4]]
[1] 2 2

Here's a solution that is not verbose:

strsplit(paste(input, collapse = " "), " NA ")
[[1]]
[1] "1 2 1" "3 2"   "1 5 6" "2 2" 

Another, quite similar way like @tpetzoldt and @tmfmnk, also removing the NA .

. <- is.na(input)
split(input[!.], cumsum(.)[!.])
#$`0`
#[1] 1 2 1
#
#$`1`
#[1] 3 2
#
#$`2`
#[1] 1 5 6
#
#$`3`
#[1] 2 2

Or the other way round

i <- !is.na(input)
split(input[i], cumsum(!i)[i])

or even

i <- is.na(input)
j <- which(!i)
split(input[j], cumsum(.)[j])

Benchmark

set.seed(42)
n <- 1e5
input <- sample(c(1:9, NA), n, TRUE)

library(tidyverse) #for TarJae

bench::mark(check = FALSE,
tmfmnk = split(na.omit(input), cumsum(is.na(input))[!is.na(input)]),
tpetzoldt = {tmp <- cumsum(is.na(input))
    lapply(split(input, tmp), na.omit)},
TarJae = {tibble(input) %>% 
  group_by(id = cumsum(is.na(input))) %>% 
  na.omit %>% 
  group_split() %>% 
  map(.,~(.x %>%select(-id))) %>% 
      map(.,~(.x %>%pull))},
ChrisR = strsplit(paste(input, collapse = " "), " NA "), #Returns String
Thomas = split(na.omit(input), findInterval(seq_along(input)[!is.na(input)], which(is.na(input)))),
GKi1 = {. <- is.na(input); split(input[!.], cumsum(.)[!.])},
GKi2 = {i <- !is.na(input); split(input[i], cumsum(!i)[i])},
GKi3 = {i <- is.na(input); j <- which(!i); split(input[j], cumsum(.)[j])}
)
#  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 tmfmnk      7.28ms  8.25ms   45.7       7.93MB     5.96    23     3    503.5ms
#2 tpetzoldt  46.65ms 49.07ms   19.8        4.4MB     5.95    10     3    504.4ms
#3 TarJae      14.17s  14.17s    0.0706   98.25MB     3.74     1    53      14.2s
#4 ChrisR     17.92ms 18.47ms   54.2        1.8MB     0       28     0    516.6ms
#5 Thomas      7.78ms  7.92ms  113.        8.71MB    25.8     57    13    503.7ms
#6 GKi1        6.71ms  6.84ms   81.6       6.63MB     7.96    41     4    502.3ms
#7 GKi2        6.71ms  6.81ms  136.        6.63MB    11.9     69     6      506ms
#8 GKi3         6.6ms  6.71ms  143.        5.52MB    11.9     72     6    502.8ms

GKi3 is in this case about 1.2 times faster than Tomas, 2.5 times than ChrisR, 3 times than tmfmnk, 7 times than tpetzoldt and 2000 than TarJae.

We can use split + findIntervals as well

> split(na.omit(input), findInterval(seq_along(input)[!is.na(input)], which(is.na(input))))
$`0`
[1] 1 2 1

$`1`
[1] 3 2

$`2`
[1] 1 5 6

$`3`
[1] 2 2

One way to split a vector by each NA value into multiple vectors is to use the split function in R.

Here is an example of how you could do this:

Create an index of the positions of the NA values in the input vector

na_indices <- which(is.na(input))

Split the input vector into a list of vectors by the NA values

output <- split(input, cumsum(c(1, diff(na_indices) > 1)))

This will create a list called output that contains multiple vectors, with each vector representing a group of consecutive values in the input vector that are separated by one or more NA values.

You can then access each vector in the list using indexing, for example:

output[[1]] # access the first vector in the list output[[2]] # access the second vector in the list

I hope this helps. Let me know if you have any questions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM