简体   繁体   English

NA序列的最小值和最大值

[英]Min and max of NA sequences

I have a data frame in which the column foo contains running sequences of NA values. 我有一个数据框,其中列foo包含运行中的NA值序列。 For example: 例如:

> test
   id  foo                time
1   1 <NA> 2018-11-19 00:00:48
2   1 <NA> 2018-11-19 00:10:51
3   1 <NA> 2018-11-19 00:21:15
4   1 <NA> 2018-11-19 00:31:02
5   1    x 2018-11-19 00:40:59
6   1    x 2018-11-19 00:50:49
7   1    x 2018-11-19 01:01:15
8   1 <NA> 2018-11-19 01:11:07
9   1 <NA> 2018-11-19 01:20:49
10  2 <NA> 2018-11-19 01:30:50
11  2 <NA> 2018-11-19 01:40:43
12  2    x 2018-11-19 01:50:46
13  2    x 2018-11-19 02:01:02
14  2    x 2018-11-19 02:10:44
15  2 <NA> 2018-11-19 02:20:51
16  2 <NA> 2018-11-19 02:31:06
17  2 <NA> 2018-11-19 02:40:42
18  2 <NA> 2018-11-19 02:50:45
19  3 <NA> 2018-11-19 03:01:00
20  3 <NA> 2018-11-19 03:10:42
21  3 <NA> 2018-11-19 03:21:10
22  3 <NA> 2018-11-19 03:31:10
23  3    x 2018-11-19 03:40:44
24  3 <NA> 2018-11-19 03:50:46
25  3 <NA> 2018-11-19 04:00:46

My objective is to mark where each sequence begins by id and time for example - the above dataset would have an extra column called index which marks where the starts and ends of these NA values are. 我的目标是例如通过idtime标记每个序列的位置-上面的数据集将有一个名为index的额外列,用于标记这些NA值的开始和结束位置。 However, the last NA in the id series should be ignored, and a single NA value would be marked as "both". 但是,应忽略id系列中的最后一个NA,并且将单个NA值标记为“两个”。 For example: 例如:

> test
   id  foo                time     index
1   1 <NA> 2018-11-19 00:00:48 na_starts
2   1 <NA> 2018-11-19 00:10:51          
3   1 <NA> 2018-11-19 00:21:15          
4   1 <NA> 2018-11-19 00:31:02   na_ends
5   1    x 2018-11-19 00:40:59          
6   1    x 2018-11-19 00:50:49          
7   1    x 2018-11-19 01:01:15          
8   1 <NA> 2018-11-19 01:11:07 na_starts
9   1 <NA> 2018-11-19 01:20:49          
10  2 <NA> 2018-11-19 01:30:50 na_starts
11  2 <NA> 2018-11-19 01:40:43   na_ends
12  2    x 2018-11-19 01:50:46          
13  2    x 2018-11-19 02:01:02          
14  2    x 2018-11-19 02:10:44          
15  2 <NA> 2018-11-19 02:20:51 na_starts
16  2 <NA> 2018-11-19 02:31:06          
17  2 <NA> 2018-11-19 02:40:42          
18  2 <NA> 2018-11-19 02:50:45          
19  3 <NA> 2018-11-19 03:01:00          
20  3 <NA> 2018-11-19 03:10:42 na_starts
21  3 <NA> 2018-11-19 03:21:10          
22  3 <NA> 2018-11-19 03:31:10   na_ends
23  3    x 2018-11-19 03:40:44          
24  3 <NA> 2018-11-19 03:50:46      both
25  3    x 2018-11-19 04:00:46   

How would one achieve this with rle or a similar function in R? 如何用rle或R中的类似功能实现这一目标?

 dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), foo = c(NA, NA, NA, NA, 
"x", "x", "x", NA, NA, NA, NA, "x", "x", "x", NA, NA, NA, NA, 
NA, NA, NA, NA, "x", NA, "x"), time = structure(c(1542585648, 
1542586251, 1542586875, 1542587462, 1542588059, 1542588649, 1542589275, 
1542589867, 1542590449, 1542591050, 1542591643, 1542592246, 1542592862, 
1542593444, 1542594051, 1542594666, 1542595242, 1542595845, 1542596460, 
1542597042, 1542597670, 1542598270, 1542598844, 1542599446, 1542600046
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
-25L), class = "data.frame")

Maybe this will work? 也许这行得通吗? I'm not entirely sure what relationship time has to the problem other than I think you wanted it sorted by id and time . 除了我认为您希望按idtime进行排序外,我还不确定time与问题之间的关系。

library("tidyverse")                                                                                                                                            -25L), class = "data.frame")
test = test %>% 
  arrange(id, time) %>% 
  mutate(miss = is.na(foo))

# This will make the index column for a single run
mark_ends = function(n, miss){
  if(!miss){
    rep("", times = n)
  }
  else{
    if(n == 1){"both"}
    else(c("na_starts", rep("", times = (n-2)), "na_ends"))}
}

# This will use mark_ends across a single ID
mark_index = function(id){
   runs = test$miss[test$id == id] %>% 
     rle
  result = Map(f = mark_ends, n = runs$lengths, miss = runs$values) %>% 
    reduce(.f = c)
  result[length(result)] = ""
  result
}

# use the function on each id, combine, and put it in test
test$index = unique(test$id) %>% 
  map(mark_index) %>% 
  reduce(.f = c)

Using tidyverse and data.table you can do: 使用tidyversedata.table您可以执行以下操作:

df %>%
 rowid_to_column() %>%
 group_by(id, temp = rleid(foo)) %>%
 mutate(temp2 = seq_along(temp),
        index = ifelse(is.na(foo) & temp2 == min(temp2) & temp2 == max(temp2), paste0("both"), 
                       ifelse(is.na(foo) & temp2 == min(temp2), paste0("na_starts"), 
                              ifelse(is.na(foo) & temp2 == max(temp2), paste0("na_ends"), NA)))) %>%
 group_by(id) %>%
 mutate(index = ifelse(rowid == max(rowid[is.na(foo) & max(temp) & max(temp2)]) & 
                         is.na(lag(foo)), NA, index)) %>%
 select(-temp, -temp2, -rowid)

      id foo   time                index    
   <dbl> <chr> <dttm>              <chr>    
 1    1. <NA>  2018-11-19 00:00:48 na_starts
 2    1. <NA>  2018-11-19 00:10:51 <NA>     
 3    1. <NA>  2018-11-19 00:21:15 <NA>     
 4    1. <NA>  2018-11-19 00:31:02 na_ends  
 5    1. x     2018-11-19 00:40:59 <NA>     
 6    1. x     2018-11-19 00:50:49 <NA>     
 7    1. x     2018-11-19 01:01:15 <NA>     
 8    1. <NA>  2018-11-19 01:11:07 na_starts
 9    1. <NA>  2018-11-19 01:20:49 <NA>     
10    2. <NA>  2018-11-19 01:30:50 na_starts
11    2. <NA>  2018-11-19 01:40:43 na_ends  
12    2. x     2018-11-19 01:50:46 <NA>     
13    2. x     2018-11-19 02:01:02 <NA>     
14    2. x     2018-11-19 02:10:44 <NA>     
15    2. <NA>  2018-11-19 02:20:51 na_starts
16    2. <NA>  2018-11-19 02:31:06 <NA>     
17    2. <NA>  2018-11-19 02:40:42 <NA>     
18    2. <NA>  2018-11-19 02:50:45 <NA>     
19    3. <NA>  2018-11-19 03:01:00 na_starts
20    3. <NA>  2018-11-19 03:10:42 <NA>     
21    3. <NA>  2018-11-19 03:21:10 <NA>     
22    3. <NA>  2018-11-19 03:31:10 na_ends  
23    3. x     2018-11-19 03:40:44 <NA>     
24    3. <NA>  2018-11-19 03:50:46 both     
25    3. x     2018-11-19 04:00:46 <NA> 

First, it is creating a unique row ID. 首先,它正在创建唯一的行ID。 Second, it is grouping by "id" and the run length of "foo". 其次,它按“ id”和运行长度“ foo”分组。 Third, it is sequencing around the run length of "foo". 第三,它围绕“ foo”的运行长度排序。 Forth, it is creating the "index" variable using the given conditions. 第四,它使用给定条件创建“ index”变量。 Then, it is grouping by "id" and assigns NA to the last row of a missing "foo" sequence per id. 然后,它按“ id”分组,并为每个id将NA分配给丢失的“ foo”序列的最后一行。 Finally, it removes the redundant variables. 最后,它删除了冗余变量。

A possible solution using : 使用的可能解决方案:

library(data.table)
setDT(test)

ind <- test[, .(ri = unique(.I[c(1,.N)][all(is.na(foo))]))
            , by = .(id, rl = rleid(is.na(foo)))
            ][, index := list("both",c("na_starts","na_ends"))[[1 + (.N > 1)]]
              , by = .(id, rl)][]

test[ind$ri, index := ind$index
     ][test[, .I[.N], by = id]$V1, index := NA][]

which gives: 这使:

 > test id foo time index 1: 1 <NA> 2018-11-19 00:00:48 na_starts 2: 1 <NA> 2018-11-19 00:10:51 <NA> 3: 1 <NA> 2018-11-19 00:21:15 <NA> 4: 1 <NA> 2018-11-19 00:31:02 na_ends 5: 1 x 2018-11-19 00:40:59 <NA> 6: 1 x 2018-11-19 00:50:49 <NA> 7: 1 x 2018-11-19 01:01:15 <NA> 8: 1 <NA> 2018-11-19 01:11:07 na_starts 9: 1 <NA> 2018-11-19 01:20:49 <NA> 10: 2 <NA> 2018-11-19 01:30:50 na_starts 11: 2 <NA> 2018-11-19 01:40:43 na_ends 12: 2 x 2018-11-19 01:50:46 <NA> 13: 2 x 2018-11-19 02:01:02 <NA> 14: 2 x 2018-11-19 02:10:44 <NA> 15: 2 <NA> 2018-11-19 02:20:51 na_starts 16: 2 <NA> 2018-11-19 02:31:06 <NA> 17: 2 <NA> 2018-11-19 02:40:42 <NA> 18: 2 <NA> 2018-11-19 02:50:45 <NA> 19: 3 <NA> 2018-11-19 03:01:00 na_starts 20: 3 <NA> 2018-11-19 03:10:42 <NA> 21: 3 <NA> 2018-11-19 03:21:10 <NA> 22: 3 <NA> 2018-11-19 03:31:10 na_ends 23: 3 x 2018-11-19 03:40:44 <NA> 24: 3 <NA> 2018-11-19 03:50:46 both 25: 3 x 2018-11-19 04:00:46 <NA> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 NA 值查找 POSIXct 日期的最小值或最大值 - Finding the min or max of POSIXct date with NA values min max 给出 NA 值 R dplyr - min max giving NA values R dplyr 为什么我的 dataframe 中的 min() 和 max() 会导致“NA”? - why I get “NA” as a result of min() and max() in my dataframe? R中不存在NA时将最小值或最大值函数应用于数组 - Apply Min or Max function to arrays when NA exist in R 为什么min / max / sum(c(NA,4,5),na.rm =“ xyz”)起作用,而具有相同输入的mean()却不起作用? - Why does min/max/sum(c(NA, 4, 5), na.rm = “xyz”) work while mean() with same inputs doesn't? seq.default(from = min(x,na.rm = TRUE),to = max(x,na.rm = TRUE)中的错误 - Error in seq.default(from = min(x, na.rm = TRUE), to = max(x, na.rm = TRUE), : 'from' cannot be NA, NaN or infinite R 错误'seq.default 中的错误(min(x,na.rm = T),max(x,na.rm = T),长度 = 长度(ColRamp)):'来自'必须是有限数' - R Error 'Error in seq.default(min(x, na.rm = T), max(x, na.rm = T), length = length(ColRamp)) : 'from' must be a finite number' 在矢量中生成随机长度NA的随机序列 - generate random sequences of NA of random lengths in a vector 替换状态代码序列中的 NA 值 - Replacing NA values within sequences for a state code max.col删除NA - max.col with NA removal
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM