[英]Min and max of NA sequences
I have a data frame in which the column foo
contains running sequences of NA values. 我有一个数据框,其中列
foo
包含运行中的NA值序列。 For example: 例如:
> test
id foo time
1 1 <NA> 2018-11-19 00:00:48
2 1 <NA> 2018-11-19 00:10:51
3 1 <NA> 2018-11-19 00:21:15
4 1 <NA> 2018-11-19 00:31:02
5 1 x 2018-11-19 00:40:59
6 1 x 2018-11-19 00:50:49
7 1 x 2018-11-19 01:01:15
8 1 <NA> 2018-11-19 01:11:07
9 1 <NA> 2018-11-19 01:20:49
10 2 <NA> 2018-11-19 01:30:50
11 2 <NA> 2018-11-19 01:40:43
12 2 x 2018-11-19 01:50:46
13 2 x 2018-11-19 02:01:02
14 2 x 2018-11-19 02:10:44
15 2 <NA> 2018-11-19 02:20:51
16 2 <NA> 2018-11-19 02:31:06
17 2 <NA> 2018-11-19 02:40:42
18 2 <NA> 2018-11-19 02:50:45
19 3 <NA> 2018-11-19 03:01:00
20 3 <NA> 2018-11-19 03:10:42
21 3 <NA> 2018-11-19 03:21:10
22 3 <NA> 2018-11-19 03:31:10
23 3 x 2018-11-19 03:40:44
24 3 <NA> 2018-11-19 03:50:46
25 3 <NA> 2018-11-19 04:00:46
My objective is to mark where each sequence begins by id
and time
for example - the above dataset would have an extra column called index
which marks where the starts and ends of these NA values are. 我的目标是例如通过
id
和time
标记每个序列的位置-上面的数据集将有一个名为index
的额外列,用于标记这些NA值的开始和结束位置。 However, the last NA in the id
series should be ignored, and a single NA value would be marked as "both". 但是,应忽略
id
系列中的最后一个NA,并且将单个NA值标记为“两个”。 For example: 例如:
> test
id foo time index
1 1 <NA> 2018-11-19 00:00:48 na_starts
2 1 <NA> 2018-11-19 00:10:51
3 1 <NA> 2018-11-19 00:21:15
4 1 <NA> 2018-11-19 00:31:02 na_ends
5 1 x 2018-11-19 00:40:59
6 1 x 2018-11-19 00:50:49
7 1 x 2018-11-19 01:01:15
8 1 <NA> 2018-11-19 01:11:07 na_starts
9 1 <NA> 2018-11-19 01:20:49
10 2 <NA> 2018-11-19 01:30:50 na_starts
11 2 <NA> 2018-11-19 01:40:43 na_ends
12 2 x 2018-11-19 01:50:46
13 2 x 2018-11-19 02:01:02
14 2 x 2018-11-19 02:10:44
15 2 <NA> 2018-11-19 02:20:51 na_starts
16 2 <NA> 2018-11-19 02:31:06
17 2 <NA> 2018-11-19 02:40:42
18 2 <NA> 2018-11-19 02:50:45
19 3 <NA> 2018-11-19 03:01:00
20 3 <NA> 2018-11-19 03:10:42 na_starts
21 3 <NA> 2018-11-19 03:21:10
22 3 <NA> 2018-11-19 03:31:10 na_ends
23 3 x 2018-11-19 03:40:44
24 3 <NA> 2018-11-19 03:50:46 both
25 3 x 2018-11-19 04:00:46
How would one achieve this with rle
or a similar function in R? 如何用
rle
或R中的类似功能实现这一目标?
dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), foo = c(NA, NA, NA, NA,
"x", "x", "x", NA, NA, NA, NA, "x", "x", "x", NA, NA, NA, NA,
NA, NA, NA, NA, "x", NA, "x"), time = structure(c(1542585648,
1542586251, 1542586875, 1542587462, 1542588059, 1542588649, 1542589275,
1542589867, 1542590449, 1542591050, 1542591643, 1542592246, 1542592862,
1542593444, 1542594051, 1542594666, 1542595242, 1542595845, 1542596460,
1542597042, 1542597670, 1542598270, 1542598844, 1542599446, 1542600046
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-25L), class = "data.frame")
Maybe this will work? 也许这行得通吗? I'm not entirely sure what relationship
time
has to the problem other than I think you wanted it sorted by id
and time
. 除了我认为您希望按
id
和time
进行排序外,我还不确定time
与问题之间的关系。
library("tidyverse") -25L), class = "data.frame")
test = test %>%
arrange(id, time) %>%
mutate(miss = is.na(foo))
# This will make the index column for a single run
mark_ends = function(n, miss){
if(!miss){
rep("", times = n)
}
else{
if(n == 1){"both"}
else(c("na_starts", rep("", times = (n-2)), "na_ends"))}
}
# This will use mark_ends across a single ID
mark_index = function(id){
runs = test$miss[test$id == id] %>%
rle
result = Map(f = mark_ends, n = runs$lengths, miss = runs$values) %>%
reduce(.f = c)
result[length(result)] = ""
result
}
# use the function on each id, combine, and put it in test
test$index = unique(test$id) %>%
map(mark_index) %>%
reduce(.f = c)
Using tidyverse
and data.table
you can do: 使用
tidyverse
和data.table
您可以执行以下操作:
df %>%
rowid_to_column() %>%
group_by(id, temp = rleid(foo)) %>%
mutate(temp2 = seq_along(temp),
index = ifelse(is.na(foo) & temp2 == min(temp2) & temp2 == max(temp2), paste0("both"),
ifelse(is.na(foo) & temp2 == min(temp2), paste0("na_starts"),
ifelse(is.na(foo) & temp2 == max(temp2), paste0("na_ends"), NA)))) %>%
group_by(id) %>%
mutate(index = ifelse(rowid == max(rowid[is.na(foo) & max(temp) & max(temp2)]) &
is.na(lag(foo)), NA, index)) %>%
select(-temp, -temp2, -rowid)
id foo time index
<dbl> <chr> <dttm> <chr>
1 1. <NA> 2018-11-19 00:00:48 na_starts
2 1. <NA> 2018-11-19 00:10:51 <NA>
3 1. <NA> 2018-11-19 00:21:15 <NA>
4 1. <NA> 2018-11-19 00:31:02 na_ends
5 1. x 2018-11-19 00:40:59 <NA>
6 1. x 2018-11-19 00:50:49 <NA>
7 1. x 2018-11-19 01:01:15 <NA>
8 1. <NA> 2018-11-19 01:11:07 na_starts
9 1. <NA> 2018-11-19 01:20:49 <NA>
10 2. <NA> 2018-11-19 01:30:50 na_starts
11 2. <NA> 2018-11-19 01:40:43 na_ends
12 2. x 2018-11-19 01:50:46 <NA>
13 2. x 2018-11-19 02:01:02 <NA>
14 2. x 2018-11-19 02:10:44 <NA>
15 2. <NA> 2018-11-19 02:20:51 na_starts
16 2. <NA> 2018-11-19 02:31:06 <NA>
17 2. <NA> 2018-11-19 02:40:42 <NA>
18 2. <NA> 2018-11-19 02:50:45 <NA>
19 3. <NA> 2018-11-19 03:01:00 na_starts
20 3. <NA> 2018-11-19 03:10:42 <NA>
21 3. <NA> 2018-11-19 03:21:10 <NA>
22 3. <NA> 2018-11-19 03:31:10 na_ends
23 3. x 2018-11-19 03:40:44 <NA>
24 3. <NA> 2018-11-19 03:50:46 both
25 3. x 2018-11-19 04:00:46 <NA>
First, it is creating a unique row ID. 首先,它正在创建唯一的行ID。 Second, it is grouping by "id" and the run length of "foo".
其次,它按“ id”和运行长度“ foo”分组。 Third, it is sequencing around the run length of "foo".
第三,它围绕“ foo”的运行长度排序。 Forth, it is creating the "index" variable using the given conditions.
第四,它使用给定条件创建“ index”变量。 Then, it is grouping by "id" and assigns NA to the last row of a missing "foo" sequence per id.
然后,它按“ id”分组,并为每个id将NA分配给丢失的“ foo”序列的最后一行。 Finally, it removes the redundant variables.
最后,它删除了冗余变量。
A possible solution using data.table : 使用data.table的可能解决方案:
library(data.table)
setDT(test)
ind <- test[, .(ri = unique(.I[c(1,.N)][all(is.na(foo))]))
, by = .(id, rl = rleid(is.na(foo)))
][, index := list("both",c("na_starts","na_ends"))[[1 + (.N > 1)]]
, by = .(id, rl)][]
test[ind$ri, index := ind$index
][test[, .I[.N], by = id]$V1, index := NA][]
which gives: 这使:
> test id foo time index 1: 1 <NA> 2018-11-19 00:00:48 na_starts 2: 1 <NA> 2018-11-19 00:10:51 <NA> 3: 1 <NA> 2018-11-19 00:21:15 <NA> 4: 1 <NA> 2018-11-19 00:31:02 na_ends 5: 1 x 2018-11-19 00:40:59 <NA> 6: 1 x 2018-11-19 00:50:49 <NA> 7: 1 x 2018-11-19 01:01:15 <NA> 8: 1 <NA> 2018-11-19 01:11:07 na_starts 9: 1 <NA> 2018-11-19 01:20:49 <NA> 10: 2 <NA> 2018-11-19 01:30:50 na_starts 11: 2 <NA> 2018-11-19 01:40:43 na_ends 12: 2 x 2018-11-19 01:50:46 <NA> 13: 2 x 2018-11-19 02:01:02 <NA> 14: 2 x 2018-11-19 02:10:44 <NA> 15: 2 <NA> 2018-11-19 02:20:51 na_starts 16: 2 <NA> 2018-11-19 02:31:06 <NA> 17: 2 <NA> 2018-11-19 02:40:42 <NA> 18: 2 <NA> 2018-11-19 02:50:45 <NA> 19: 3 <NA> 2018-11-19 03:01:00 na_starts 20: 3 <NA> 2018-11-19 03:10:42 <NA> 21: 3 <NA> 2018-11-19 03:21:10 <NA> 22: 3 <NA> 2018-11-19 03:31:10 na_ends 23: 3 x 2018-11-19 03:40:44 <NA> 24: 3 <NA> 2018-11-19 03:50:46 both 25: 3 x 2018-11-19 04:00:46 <NA>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.