NA序列的最小值和最大值

Question

I have a data frame in which the column foo contains running sequences of NA values. 我有一个数据框，其中列foo包含运行中的NA值序列。 For example: 例如：

> test
   id  foo                time
1   1 <NA> 2018-11-19 00:00:48
2   1 <NA> 2018-11-19 00:10:51
3   1 <NA> 2018-11-19 00:21:15
4   1 <NA> 2018-11-19 00:31:02
5   1    x 2018-11-19 00:40:59
6   1    x 2018-11-19 00:50:49
7   1    x 2018-11-19 01:01:15
8   1 <NA> 2018-11-19 01:11:07
9   1 <NA> 2018-11-19 01:20:49
10  2 <NA> 2018-11-19 01:30:50
11  2 <NA> 2018-11-19 01:40:43
12  2    x 2018-11-19 01:50:46
13  2    x 2018-11-19 02:01:02
14  2    x 2018-11-19 02:10:44
15  2 <NA> 2018-11-19 02:20:51
16  2 <NA> 2018-11-19 02:31:06
17  2 <NA> 2018-11-19 02:40:42
18  2 <NA> 2018-11-19 02:50:45
19  3 <NA> 2018-11-19 03:01:00
20  3 <NA> 2018-11-19 03:10:42
21  3 <NA> 2018-11-19 03:21:10
22  3 <NA> 2018-11-19 03:31:10
23  3    x 2018-11-19 03:40:44
24  3 <NA> 2018-11-19 03:50:46
25  3 <NA> 2018-11-19 04:00:46

My objective is to mark where each sequence begins by id and time for example - the above dataset would have an extra column called index which marks where the starts and ends of these NA values are. 我的目标是例如通过id和time标记每个序列的位置-上面的数据集将有一个名为index的额外列，用于标记这些NA值的开始和结束位置。 However, the last NA in the id series should be ignored, and a single NA value would be marked as "both". 但是，应忽略id系列中的最后一个NA，并且将单个NA值标记为“两个”。 For example: 例如：

> test
   id  foo                time     index
1   1 <NA> 2018-11-19 00:00:48 na_starts
2   1 <NA> 2018-11-19 00:10:51          
3   1 <NA> 2018-11-19 00:21:15          
4   1 <NA> 2018-11-19 00:31:02   na_ends
5   1    x 2018-11-19 00:40:59          
6   1    x 2018-11-19 00:50:49          
7   1    x 2018-11-19 01:01:15          
8   1 <NA> 2018-11-19 01:11:07 na_starts
9   1 <NA> 2018-11-19 01:20:49          
10  2 <NA> 2018-11-19 01:30:50 na_starts
11  2 <NA> 2018-11-19 01:40:43   na_ends
12  2    x 2018-11-19 01:50:46          
13  2    x 2018-11-19 02:01:02          
14  2    x 2018-11-19 02:10:44          
15  2 <NA> 2018-11-19 02:20:51 na_starts
16  2 <NA> 2018-11-19 02:31:06          
17  2 <NA> 2018-11-19 02:40:42          
18  2 <NA> 2018-11-19 02:50:45          
19  3 <NA> 2018-11-19 03:01:00          
20  3 <NA> 2018-11-19 03:10:42 na_starts
21  3 <NA> 2018-11-19 03:21:10          
22  3 <NA> 2018-11-19 03:31:10   na_ends
23  3    x 2018-11-19 03:40:44          
24  3 <NA> 2018-11-19 03:50:46      both
25  3    x 2018-11-19 04:00:46

How would one achieve this with rle or a similar function in R? 如何用rle或R中的类似功能实现这一目标？

 dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), foo = c(NA, NA, NA, NA, 
"x", "x", "x", NA, NA, NA, NA, "x", "x", "x", NA, NA, NA, NA, 
NA, NA, NA, NA, "x", NA, "x"), time = structure(c(1542585648, 
1542586251, 1542586875, 1542587462, 1542588059, 1542588649, 1542589275, 
1542589867, 1542590449, 1542591050, 1542591643, 1542592246, 1542592862, 
1542593444, 1542594051, 1542594666, 1542595242, 1542595845, 1542596460, 
1542597042, 1542597670, 1542598270, 1542598844, 1542599446, 1542600046
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
-25L), class = "data.frame")

Answer 1

Maybe this will work? 也许这行得通吗？ I'm not entirely sure what relationship time has to the problem other than I think you wanted it sorted by id and time . 除了我认为您希望按id和time进行排序外，我还不确定time与问题之间的关系。

library("tidyverse")                                                                                                                                            -25L), class = "data.frame")
test = test %>% 
  arrange(id, time) %>% 
  mutate(miss = is.na(foo))

# This will make the index column for a single run
mark_ends = function(n, miss){
  if(!miss){
    rep("", times = n)
  }
  else{
    if(n == 1){"both"}
    else(c("na_starts", rep("", times = (n-2)), "na_ends"))}
}

# This will use mark_ends across a single ID
mark_index = function(id){
   runs = test$miss[test$id == id] %>% 
     rle
  result = Map(f = mark_ends, n = runs$lengths, miss = runs$values) %>% 
    reduce(.f = c)
  result[length(result)] = ""
  result
}

# use the function on each id, combine, and put it in test
test$index = unique(test$id) %>% 
  map(mark_index) %>% 
  reduce(.f = c)

Answer 2

Using tidyverse and data.table you can do: 使用tidyverse和data.table您可以执行以下操作：

df %>%
 rowid_to_column() %>%
 group_by(id, temp = rleid(foo)) %>%
 mutate(temp2 = seq_along(temp),
        index = ifelse(is.na(foo) & temp2 == min(temp2) & temp2 == max(temp2), paste0("both"), 
                       ifelse(is.na(foo) & temp2 == min(temp2), paste0("na_starts"), 
                              ifelse(is.na(foo) & temp2 == max(temp2), paste0("na_ends"), NA)))) %>%
 group_by(id) %>%
 mutate(index = ifelse(rowid == max(rowid[is.na(foo) & max(temp) & max(temp2)]) & 
                         is.na(lag(foo)), NA, index)) %>%
 select(-temp, -temp2, -rowid)

      id foo   time                index    
   <dbl> <chr> <dttm>              <chr>    
 1    1. <NA>  2018-11-19 00:00:48 na_starts
 2    1. <NA>  2018-11-19 00:10:51 <NA>     
 3    1. <NA>  2018-11-19 00:21:15 <NA>     
 4    1. <NA>  2018-11-19 00:31:02 na_ends  
 5    1. x     2018-11-19 00:40:59 <NA>     
 6    1. x     2018-11-19 00:50:49 <NA>     
 7    1. x     2018-11-19 01:01:15 <NA>     
 8    1. <NA>  2018-11-19 01:11:07 na_starts
 9    1. <NA>  2018-11-19 01:20:49 <NA>     
10    2. <NA>  2018-11-19 01:30:50 na_starts
11    2. <NA>  2018-11-19 01:40:43 na_ends  
12    2. x     2018-11-19 01:50:46 <NA>     
13    2. x     2018-11-19 02:01:02 <NA>     
14    2. x     2018-11-19 02:10:44 <NA>     
15    2. <NA>  2018-11-19 02:20:51 na_starts
16    2. <NA>  2018-11-19 02:31:06 <NA>     
17    2. <NA>  2018-11-19 02:40:42 <NA>     
18    2. <NA>  2018-11-19 02:50:45 <NA>     
19    3. <NA>  2018-11-19 03:01:00 na_starts
20    3. <NA>  2018-11-19 03:10:42 <NA>     
21    3. <NA>  2018-11-19 03:21:10 <NA>     
22    3. <NA>  2018-11-19 03:31:10 na_ends  
23    3. x     2018-11-19 03:40:44 <NA>     
24    3. <NA>  2018-11-19 03:50:46 both     
25    3. x     2018-11-19 04:00:46 <NA>

First, it is creating a unique row ID. 首先，它正在创建唯一的行ID。 Second, it is grouping by "id" and the run length of "foo". 其次，它按“ id”和运行长度“ foo”分组。 Third, it is sequencing around the run length of "foo". 第三，它围绕“ foo”的运行长度排序。 Forth, it is creating the "index" variable using the given conditions. 第四，它使用给定条件创建“ index”变量。 Then, it is grouping by "id" and assigns NA to the last row of a missing "foo" sequence per id. 然后，它按“ id”分组，并为每个id将NA分配给丢失的“ foo”序列的最后一行。 Finally, it removes the redundant variables. 最后，它删除了冗余变量。

Answer 3

A possible solution using data.table : 使用data.table的可能解决方案：

library(data.table)
setDT(test)

ind <- test[, .(ri = unique(.I[c(1,.N)][all(is.na(foo))]))
            , by = .(id, rl = rleid(is.na(foo)))
            ][, index := list("both",c("na_starts","na_ends"))[[1 + (.N > 1)]]
              , by = .(id, rl)][]

test[ind$ri, index := ind$index
     ][test[, .I[.N], by = id]$V1, index := NA][]

which gives: 这使：

 > test id foo time index 1: 1 <NA> 2018-11-19 00:00:48 na_starts 2: 1 <NA> 2018-11-19 00:10:51 <NA> 3: 1 <NA> 2018-11-19 00:21:15 <NA> 4: 1 <NA> 2018-11-19 00:31:02 na_ends 5: 1 x 2018-11-19 00:40:59 <NA> 6: 1 x 2018-11-19 00:50:49 <NA> 7: 1 x 2018-11-19 01:01:15 <NA> 8: 1 <NA> 2018-11-19 01:11:07 na_starts 9: 1 <NA> 2018-11-19 01:20:49 <NA> 10: 2 <NA> 2018-11-19 01:30:50 na_starts 11: 2 <NA> 2018-11-19 01:40:43 na_ends 12: 2 x 2018-11-19 01:50:46 <NA> 13: 2 x 2018-11-19 02:01:02 <NA> 14: 2 x 2018-11-19 02:10:44 <NA> 15: 2 <NA> 2018-11-19 02:20:51 na_starts 16: 2 <NA> 2018-11-19 02:31:06 <NA> 17: 2 <NA> 2018-11-19 02:40:42 <NA> 18: 2 <NA> 2018-11-19 02:50:45 <NA> 19: 3 <NA> 2018-11-19 03:01:00 na_starts 20: 3 <NA> 2018-11-19 03:10:42 <NA> 21: 3 <NA> 2018-11-19 03:21:10 <NA> 22: 3 <NA> 2018-11-19 03:31:10 na_ends 23: 3 x 2018-11-19 03:40:44 <NA> 24: 3 <NA> 2018-11-19 03:50:46 both 25: 3 x 2018-11-19 04:00:46 <NA>

NA序列的最小值和最大值

问题描述

3 个解决方案

解决方案1
2 已采纳 2018-12-21 22:04:18

解决方案2
1 2018-12-21 22:30:03

解决方案3
1 2018-12-21 22:32:29

NA序列的最小值和最大值

问题描述

3 个解决方案

解决方案1 2 已采纳 2018-12-21 22:04:18

解决方案2 1 2018-12-21 22:30:03

解决方案3 1 2018-12-21 22:32:29

解决方案1
2 已采纳 2018-12-21 22:04:18

解决方案2
1 2018-12-21 22:30:03

解决方案3
1 2018-12-21 22:32:29