简体   繁体   English

当最后一个和下一个非 NA 值相等时替换 NA

[英]Replace NA when last and next non-NA values are equal

I have a sample table with some but not all NA values that need to be replaced.我有一个示例表,其中包含需要替换的一些但不是全部NA值。

> dat
   id message index
1   1    <NA>     1
2   1     foo     2
3   1     foo     3
4   1    <NA>     4
5   1     foo     5
6   1    <NA>     6
7   2    <NA>     1
8   2     baz     2
9   2    <NA>     3
10  2     baz     4
11  2     baz     5
12  2     baz     6
13  3     bar     1
14  3    <NA>     2
15  3    <NA>     3
16  3     bar     4
17  3    <NA>     5
18  3     bar     6
19  3    <NA>     7
20  3     qux     8

My objective is to replace the NA values that are surrounded by the same "message" using the first appearance of the message (the least index value) and the last appearance of the message (using the max index value) by id我的目标是代替NA相同的“消息”所包围的值使用所述消息的所述第一外观(最低index值)和消息的最后外观(使用max index按id值)

Sometimes, the NA sequences are only of length 1, other times they can be very long.有时,NA 序列的长度仅为 1,有时它们可​​能很长。 Regardless, all of the NA 's that are "sandwiched" in between the same value of "message" before and after the NA should be filled in.无论如何,所有的NA被在‘消息’的相同值之前和之后的‘夹在’的NA应被填充。

The output for the above incomplete table would be:上述不完整表的输出将是:

 > output
   id message index
1   1    <NA>     1
2   1     foo     2
3   1     foo     3
4   1     foo     4
5   1     foo     5
6   1    <NA>     6
7   2    <NA>     1
8   2     baz     2
9   2     baz     3
10  2     baz     4
11  2     baz     5
12  2     baz     6
13  3     bar     1
14  3     bar     2
15  3     bar     3
16  3     bar     4
17  3     bar     5
18  3     bar     6
19  3    <NA>     7
20  3     qux     8

Any guidance using data.table or dplyr here would be helpful as I'm not even sure where to begin.此处使用data.tabledplyr任何指导都会有所帮助,因为我什至不确定从哪里开始。

As far as I could get was subsetting by unique messages but this method does not take into account id :据我所知,是通过唯一消息进行子集化,但此方法不考虑id

#get distinct messages
messages = unique(dat$message)

#remove NA
messages = messages[!is.na(messages)]

#subset dat for each message
for (i in 1:length(messages)) {print(dat[dat$message == messages[i],]) }

the data:数据:

 dput(dat)
structure(list(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 
3, 3, 3, 3, 3, 3, 3), message = c(NA, "foo", "foo", NA, "foo", 
NA, NA, "baz", NA, "baz", "baz", "baz", "bar", NA, NA, "bar", 
NA, "bar", NA, "qux"), index = c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 
5, 6, 1, 2, 3, 4, 5, 6, 7, 8)), row.names = c(NA, -20L), class = "data.frame")

Perform an na.locf0 both fowards and backwards and if they are the same then use the common value;向前和向后执行na.locf0 ,如果它们相同,则使用公共值; otherwise, use NA.否则,使用 NA。 The grouping is done with ave .分组是用ave完成的。

library(zoo)

filler <- function(x) {
  forward <- na.locf0(x)
  backward <- na.locf0(x, fromLast = TRUE)
  ifelse(forward == backward, forward, NA)
}
transform(dat, message = ave(message, id, FUN = filler))

giving:给予:

   id message index
1   1    <NA>     1
2   1     foo     2
3   1     foo     3
4   1     foo     4
5   1     foo     5
6   1    <NA>     6
7   2    <NA>     1
8   2     baz     2
9   2     baz     3
10  2     baz     4
11  2     baz     5
12  2     baz     6
13  3     bar     1
14  3     bar     2
15  3     bar     3
16  3     bar     4
17  3     bar     5
18  3     bar     6
19  3    <NA>     7
20  3     qux     8

An option that uses na.approx from zoo .使用一个选项na.approxzoo

First, we extract the unique elements from column message that are not NA and find there positions in dat$message首先,我们从不是NAmessage中提取唯一元素,并在dat$message找到这些位置

x <- unique(na.omit(dat$message))
(y <- match(dat$message, x))
# [1] NA  1  1 NA  1 NA NA  2 NA  2  2  2  3 NA NA  3 NA  3 NA  4

library(zoo)
library(dplyr)
out <- do.call(coalesce, 
               lapply(seq_along(x), function(i) as.double(na.approx(match(y, i) * i, na.rm = FALSE))))
dat$new <- x[out]
dat
#    id message index  new
#1   1    <NA>     1 <NA>
#2   1     foo     2  foo
#3   1     foo     3  foo
#4   1    <NA>     4  foo
#5   1     foo     5  foo
#6   1    <NA>     6 <NA>
#7   2    <NA>     1 <NA>
#8   2     baz     2  baz
#9   2    <NA>     3  baz
#10  2     baz     4  baz
#11  2     baz     5  baz
#12  2     baz     6  baz
#13  3     bar     1  bar
#14  3    <NA>     2  bar
#15  3    <NA>     3  bar
#16  3     bar     4  bar
#17  3    <NA>     5  bar
#18  3     bar     6  bar
#19  3    <NA>     7 <NA>
#20  3     qux     8  qux

tl;dr tl;博士

When we call当我们打电话

match(y, 1) * 1
# [1] NA  1  1 NA  1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

we get the elements only where there are 1 s in y .我们只在y中有1的地方得到元素。 Accordingly, when we do因此,当我们做

match(y, 2) * 2
# [1] NA NA NA NA NA NA NA  2 NA  2  2  2 NA NA NA NA NA NA NA NA

the result is the same for the 2 s. 2秒的结果相同。

Think of 1 and 2 as of the first and second elements in12视为中的第一个和第二个元素

x
# [1] "foo" "baz" "bar" "qux"

that is "foo" and "baz" .那是"foo""baz"

Now for each match(y, i) * i we can call na.approx from zoo to fill the NA s that are in between ( i will become seq_along(x) later).现在对于每个match(y, i) * i我们可以从zoo调用na.approx来填充介于两者之间的NA s( i稍后将变为seq_along(x) )。

na.approx(match(y, 2) * 2, na.rm = FALSE)
# [1] NA NA NA NA NA NA NA  2  2  2  2  2 NA NA NA NA NA NA NA NA

We do the same for each element in seq_along(x) , that is 1:4 using lapply .我们对seq_along(x)每个元素执行相同的操作,即使用lapply1:4 The result is a list结果是一个列表

lapply(seq_along(x), function(i) as.double(na.approx(match(y, i) * i, na.rm = FALSE)))
#[[1]]
# [1] NA  1  1  1  1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#
#[[2]]
# [1] NA NA NA NA NA NA NA  2  2  2  2  2 NA NA NA NA NA NA NA NA
#
#[[3]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA  3  3  3  3  3  3 NA NA
#
#[[4]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  4

( as.double was needed here because else coalesce would complain that "Argument 4 must be type double, not integer" ) (这里需要as.double因为否则coalesce会抱怨“参数 4 必须是 double 类型,而不是整数类型”

We are almost there.我们就快到了。 What we need to do next is to find the first non-missing value at each position, this is where coalesce from dplyr comes into play and the result is我们接下来需要做的是找到每个位置的第一个非缺失值,这就是从dplyr coalescedplyr ,结果是

out <- do.call(coalesce, 
               lapply(seq_along(x), function(i) as.integer(na.approx(match(y, i) * i, na.rm = FALSE))))
out
# [1] NA  1  1  1  1 NA NA  2  2  2  2  2  3  3  3  3  3  3 NA  4

We can use this vector to extract the desired values from x as我们可以使用这个向量从x提取所需的值作为

x[out]
# [1] NA    "foo" "foo" "foo" "foo" NA    NA    "baz" "baz" "baz" "baz" "baz" "bar" "bar" "bar" "bar" "bar" "bar" NA    "qux"

Hope this helps.希望这可以帮助。

Here's an approach without grouping to fill the values and then replace back with NA if they were filled incorrectly.这是一种没有分组的方法来填充值,如果它们填充不正确,然后用NA替换回来。

tidyr::fill by default fills missing values with the previous value, so it will overfill some values. tidyr::fill默认用前一个值填充缺失值,因此它会溢出一些值。 Unfortunately it doesn't respect grouping so we have to use an if_else condition to fix its errors.不幸的是,它不尊重分组,所以我们必须使用if_else条件来修复它的错误。

First, we capture the original missing value locations and calculate the max and min index for each id and message .首先,我们捕获原始缺失值位置并计算每个idmessage的最大和最小index After filling, we join on these index boundaries.填充后,我们在这些index边界上加入。 If there is not a match, then the id changed;如果不匹配,则id改变; if there is a match either it was a correct replacement or the index is outside the boundaries.如果匹配,要么是正确的替换,要么index在边界之外。 So we check in the locations with original missing values for these conditions and replace back with NA if they are met.因此,我们检查具有这些条件的原始缺失值的位置,如果满足,则用NA替换。

EDIT: this can be broken on other input, attempting to fix编辑:这可以在其他输入上被破坏,试图修复

library(tidyverse)
dat <- structure(list(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3), message = c(NA, "foo", "foo", NA, "foo", NA, NA, "baz", NA, "baz", "baz", "baz", "bar", NA, NA, "bar", NA, "bar", NA, "qux"), index = c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8)), row.names = c(NA, -20L), class = "data.frame")

indices <- dat %>%
  group_by(id, message) %>%
  summarise(min = min(index), max = max(index)) %>%
  drop_na

dat %>%
  mutate(orig_na = is.na(message)) %>%
  fill(message) %>%
  left_join(indices, by = c("id", "message")) %>% 
  mutate(
    message = if_else(
      condition = orig_na &
        (index < min | index > max | is.na(min)),
      true = NA_character_,
      false = message
    )
  )
#>    id message index orig_na min max
#> 1   1    <NA>     1    TRUE  NA  NA
#> 2   1     foo     2   FALSE   2   5
#> 3   1     foo     3   FALSE   2   5
#> 4   1     foo     4    TRUE   2   5
#> 5   1     foo     5   FALSE   2   5
#> 6   1    <NA>     6    TRUE   2   5
#> 7   2    <NA>     1    TRUE  NA  NA
#> 8   2     baz     2   FALSE   2   6
#> 9   2     baz     3    TRUE   2   6
#> 10  2     baz     4   FALSE   2   6
#> 11  2     baz     5   FALSE   2   6
#> 12  2     baz     6   FALSE   2   6
#> 13  3     bar     1   FALSE   1   6
#> 14  3     bar     2    TRUE   1   6
#> 15  3     bar     3    TRUE   1   6
#> 16  3     bar     4   FALSE   1   6
#> 17  3     bar     5    TRUE   1   6
#> 18  3     bar     6   FALSE   1   6
#> 19  3    <NA>     7    TRUE   1   6
#> 20  3     qux     8   FALSE   8   8

Created on 2019-02-15 by the reprex package (v0.2.1)reprex 包(v0.2.1) 于 2019 年 2 月 15 日创建

Another tidyverse solution using case_when.另一个使用 case_when 的 tidyverse 解决方案。 Edited to avoid filling after end of series.编辑以避免在系列结束后填充。

library(dplyr)

dfr <- data.frame(
  index =  c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8),
  message = c(NA, "foo", "foo", NA, "foo", NA, NA, "baz", NA, "baz", "baz", "baz", "bar", NA, NA, "bar", NA, "bar", NA, "qux"),
  id =  c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
)

dfrFilled <- dfr %>% 
  group_by(id) %>% 
  mutate(
    endSeries = max( # identify end of series
      index[message == na.omit(message)[1]],
      na.rm = T
      ),
    filledValues = case_when(
      min(index) == index ~ message,
      max(index) == index ~ message,
      index < endSeries ~ na.omit(message)[1], # fill if index is before end of series.
      TRUE ~ message
    )
  )

If you fill both ways and check for equality that should work, as long as you account for grouping and index:如果您填写两种方式并检查应该有效的相等性,只要您考虑分组和索引:

tidyverse:整理宇宙:

library(tidyverse)

dat %>%
  arrange(id, index) %>%
  mutate(msg_down = fill(group_by(., id), message, .direction = 'down')$message,
         msg_up   = fill(group_by(., id), message, .direction = 'up')$message,
         message = case_when(!is.na(message) ~ message,
                             msg_down == msg_up ~ msg_down,
                             TRUE ~ NA_character_)) %>%
  select(-msg_down, -msg_up)

   id message index
1   1    <NA>     1
2   1     foo     2
3   1     foo     3
4   1     foo     4
5   1     foo     5
6   1    <NA>     6
7   2    <NA>     1
8   2     baz     2
9   2     baz     3
10  2     baz     4
11  2     baz     5
12  2     baz     6
13  3     bar     1
14  3     bar     2
15  3     bar     3
16  3     bar     4
17  3     bar     5
18  3     bar     6
19  3    <NA>     7
20  3     qux     8

data.table数据表

library(data.table)
library(zoo)

setDT(dat)[order(index),
           message := ifelse(na.locf(message, na.rm = FALSE) == na.locf(message, na.rm = FALSE, fromLast = TRUE),
                             na.locf(message, na.rm = FALSE),
                             NA),
           by = "id"][]

    id message index
 1:  1    <NA>     1
 2:  1     foo     2
 3:  1     foo     3
 4:  1     foo     4
 5:  1     foo     5
 6:  1    <NA>     6
 7:  2    <NA>     1
 8:  2     baz     2
 9:  2     baz     3
10:  2     baz     4
11:  2     baz     5
12:  2     baz     6
13:  3     bar     1
14:  3     bar     2
15:  3     bar     3
16:  3     bar     4
17:  3     bar     5
18:  3     bar     6
19:  3    <NA>     7
20:  3     qux     8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM