[英]Replace NA when last and next non-NA values are equal
I have a sample table with some but not all NA
values that need to be replaced.我有一个示例表,其中包含需要替换的一些但不是全部
NA
值。
> dat
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 <NA> 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 <NA> 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 <NA> 2
15 3 <NA> 3
16 3 bar 4
17 3 <NA> 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
My objective is to replace the NA
values that are surrounded by the same "message" using the first appearance of the message (the least index
value) and the last appearance of the message (using the max index
value) by id我的目标是代替
NA
由相同的“消息”所包围的值使用所述消息的所述第一外观(最低index
值)和消息的最后外观(使用max index
按id值)
Sometimes, the NA sequences are only of length 1, other times they can be very long.有时,NA 序列的长度仅为 1,有时它们可能很长。 Regardless, all of the
NA
's that are "sandwiched" in between the same value of "message" before and after the NA
should be filled in.无论如何,所有的
NA
被在‘消息’的相同值之前和之后的‘夹在’的NA
应被填充。
The output for the above incomplete table would be:上述不完整表的输出将是:
> output
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 foo 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 baz 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 bar 2
15 3 bar 3
16 3 bar 4
17 3 bar 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
Any guidance using data.table
or dplyr
here would be helpful as I'm not even sure where to begin.此处使用
data.table
或dplyr
任何指导都会有所帮助,因为我什至不确定从哪里开始。
As far as I could get was subsetting by unique messages but this method does not take into account id
:据我所知,是通过唯一消息进行子集化,但此方法不考虑
id
:
#get distinct messages
messages = unique(dat$message)
#remove NA
messages = messages[!is.na(messages)]
#subset dat for each message
for (i in 1:length(messages)) {print(dat[dat$message == messages[i],]) }
the data:数据:
dput(dat)
structure(list(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3), message = c(NA, "foo", "foo", NA, "foo",
NA, NA, "baz", NA, "baz", "baz", "baz", "bar", NA, NA, "bar",
NA, "bar", NA, "qux"), index = c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4,
5, 6, 1, 2, 3, 4, 5, 6, 7, 8)), row.names = c(NA, -20L), class = "data.frame")
Perform an na.locf0
both fowards and backwards and if they are the same then use the common value;向前和向后执行
na.locf0
,如果它们相同,则使用公共值; otherwise, use NA.否则,使用 NA。 The grouping is done with
ave
.分组是用
ave
完成的。
library(zoo)
filler <- function(x) {
forward <- na.locf0(x)
backward <- na.locf0(x, fromLast = TRUE)
ifelse(forward == backward, forward, NA)
}
transform(dat, message = ave(message, id, FUN = filler))
giving:给予:
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 foo 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 baz 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 bar 2
15 3 bar 3
16 3 bar 4
17 3 bar 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
An option that uses na.approx
from zoo
.使用一个选项
na.approx
从zoo
。
First, we extract the unique elements from column message
that are not NA
and find there positions in dat$message
首先,我们从不是
NA
列message
中提取唯一元素,并在dat$message
找到这些位置
x <- unique(na.omit(dat$message))
(y <- match(dat$message, x))
# [1] NA 1 1 NA 1 NA NA 2 NA 2 2 2 3 NA NA 3 NA 3 NA 4
library(zoo)
library(dplyr)
out <- do.call(coalesce,
lapply(seq_along(x), function(i) as.double(na.approx(match(y, i) * i, na.rm = FALSE))))
dat$new <- x[out]
dat
# id message index new
#1 1 <NA> 1 <NA>
#2 1 foo 2 foo
#3 1 foo 3 foo
#4 1 <NA> 4 foo
#5 1 foo 5 foo
#6 1 <NA> 6 <NA>
#7 2 <NA> 1 <NA>
#8 2 baz 2 baz
#9 2 <NA> 3 baz
#10 2 baz 4 baz
#11 2 baz 5 baz
#12 2 baz 6 baz
#13 3 bar 1 bar
#14 3 <NA> 2 bar
#15 3 <NA> 3 bar
#16 3 bar 4 bar
#17 3 <NA> 5 bar
#18 3 bar 6 bar
#19 3 <NA> 7 <NA>
#20 3 qux 8 qux
tl;dr tl;博士
When we call当我们打电话
match(y, 1) * 1
# [1] NA 1 1 NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
we get the elements only where there are 1
s in y
.我们只在
y
中有1
的地方得到元素。 Accordingly, when we do因此,当我们做
match(y, 2) * 2
# [1] NA NA NA NA NA NA NA 2 NA 2 2 2 NA NA NA NA NA NA NA NA
the result is the same for the 2
s. 2
秒的结果相同。
Think of 1
and 2
as of the first and second elements in将
1
和2
视为中的第一个和第二个元素
x
# [1] "foo" "baz" "bar" "qux"
that is "foo"
and "baz"
.那是
"foo"
和"baz"
。
Now for each match(y, i) * i
we can call na.approx
from zoo
to fill the NA
s that are in between ( i
will become seq_along(x)
later).现在对于每个
match(y, i) * i
我们可以从zoo
调用na.approx
来填充介于两者之间的NA
s( i
稍后将变为seq_along(x)
)。
na.approx(match(y, 2) * 2, na.rm = FALSE)
# [1] NA NA NA NA NA NA NA 2 2 2 2 2 NA NA NA NA NA NA NA NA
We do the same for each element in seq_along(x)
, that is 1:4
using lapply
.我们对
seq_along(x)
每个元素执行相同的操作,即使用lapply
为1:4
。 The result is a list结果是一个列表
lapply(seq_along(x), function(i) as.double(na.approx(match(y, i) * i, na.rm = FALSE)))
#[[1]]
# [1] NA 1 1 1 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#
#[[2]]
# [1] NA NA NA NA NA NA NA 2 2 2 2 2 NA NA NA NA NA NA NA NA
#
#[[3]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA 3 3 3 3 3 3 NA NA
#
#[[4]]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 4
( as.double
was needed here because else coalesce
would complain that "Argument 4 must be type double, not integer" ) (这里需要
as.double
因为否则coalesce
会抱怨“参数 4 必须是 double 类型,而不是整数类型” )
We are almost there.我们就快到了。 What we need to do next is to find the first non-missing value at each position, this is where
coalesce
from dplyr
comes into play and the result is我们接下来需要做的是找到每个位置的第一个非缺失值,这就是从
dplyr
coalesce
的dplyr
,结果是
out <- do.call(coalesce,
lapply(seq_along(x), function(i) as.integer(na.approx(match(y, i) * i, na.rm = FALSE))))
out
# [1] NA 1 1 1 1 NA NA 2 2 2 2 2 3 3 3 3 3 3 NA 4
We can use this vector to extract the desired values from x
as我们可以使用这个向量从
x
提取所需的值作为
x[out]
# [1] NA "foo" "foo" "foo" "foo" NA NA "baz" "baz" "baz" "baz" "baz" "bar" "bar" "bar" "bar" "bar" "bar" NA "qux"
Hope this helps.希望这可以帮助。
Here's an approach without grouping to fill the values and then replace back with NA
if they were filled incorrectly.这是一种没有分组的方法来填充值,如果它们填充不正确,然后用
NA
替换回来。
tidyr::fill
by default fills missing values with the previous value, so it will overfill some values. tidyr::fill
默认用前一个值填充缺失值,因此它会溢出一些值。 Unfortunately it doesn't respect grouping so we have to use an if_else
condition to fix its errors.不幸的是,它不尊重分组,所以我们必须使用
if_else
条件来修复它的错误。
First, we capture the original missing value locations and calculate the max and min index
for each id
and message
.首先,我们捕获原始缺失值位置并计算每个
id
和message
的最大和最小index
。 After filling, we join on these index
boundaries.填充后,我们在这些
index
边界上加入。 If there is not a match, then the id
changed;如果不匹配,则
id
改变; if there is a match either it was a correct replacement or the index
is outside the boundaries.如果匹配,要么是正确的替换,要么
index
在边界之外。 So we check in the locations with original missing values for these conditions and replace back with NA
if they are met.因此,我们检查具有这些条件的原始缺失值的位置,如果满足,则用
NA
替换。
EDIT: this can be broken on other input, attempting to fix编辑:这可以在其他输入上被破坏,试图修复
library(tidyverse)
dat <- structure(list(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3), message = c(NA, "foo", "foo", NA, "foo", NA, NA, "baz", NA, "baz", "baz", "baz", "bar", NA, NA, "bar", NA, "bar", NA, "qux"), index = c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8)), row.names = c(NA, -20L), class = "data.frame")
indices <- dat %>%
group_by(id, message) %>%
summarise(min = min(index), max = max(index)) %>%
drop_na
dat %>%
mutate(orig_na = is.na(message)) %>%
fill(message) %>%
left_join(indices, by = c("id", "message")) %>%
mutate(
message = if_else(
condition = orig_na &
(index < min | index > max | is.na(min)),
true = NA_character_,
false = message
)
)
#> id message index orig_na min max
#> 1 1 <NA> 1 TRUE NA NA
#> 2 1 foo 2 FALSE 2 5
#> 3 1 foo 3 FALSE 2 5
#> 4 1 foo 4 TRUE 2 5
#> 5 1 foo 5 FALSE 2 5
#> 6 1 <NA> 6 TRUE 2 5
#> 7 2 <NA> 1 TRUE NA NA
#> 8 2 baz 2 FALSE 2 6
#> 9 2 baz 3 TRUE 2 6
#> 10 2 baz 4 FALSE 2 6
#> 11 2 baz 5 FALSE 2 6
#> 12 2 baz 6 FALSE 2 6
#> 13 3 bar 1 FALSE 1 6
#> 14 3 bar 2 TRUE 1 6
#> 15 3 bar 3 TRUE 1 6
#> 16 3 bar 4 FALSE 1 6
#> 17 3 bar 5 TRUE 1 6
#> 18 3 bar 6 FALSE 1 6
#> 19 3 <NA> 7 TRUE 1 6
#> 20 3 qux 8 FALSE 8 8
Created on 2019-02-15 by the reprex package (v0.2.1)由reprex 包(v0.2.1) 于 2019 年 2 月 15 日创建
Another tidyverse solution using case_when.另一个使用 case_when 的 tidyverse 解决方案。 Edited to avoid filling after end of series.
编辑以避免在系列结束后填充。
library(dplyr)
dfr <- data.frame(
index = c(1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8),
message = c(NA, "foo", "foo", NA, "foo", NA, NA, "baz", NA, "baz", "baz", "baz", "bar", NA, NA, "bar", NA, "bar", NA, "qux"),
id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
)
dfrFilled <- dfr %>%
group_by(id) %>%
mutate(
endSeries = max( # identify end of series
index[message == na.omit(message)[1]],
na.rm = T
),
filledValues = case_when(
min(index) == index ~ message,
max(index) == index ~ message,
index < endSeries ~ na.omit(message)[1], # fill if index is before end of series.
TRUE ~ message
)
)
If you fill both ways and check for equality that should work, as long as you account for grouping and index:如果您填写两种方式并检查应该有效的相等性,只要您考虑分组和索引:
library(tidyverse)
dat %>%
arrange(id, index) %>%
mutate(msg_down = fill(group_by(., id), message, .direction = 'down')$message,
msg_up = fill(group_by(., id), message, .direction = 'up')$message,
message = case_when(!is.na(message) ~ message,
msg_down == msg_up ~ msg_down,
TRUE ~ NA_character_)) %>%
select(-msg_down, -msg_up)
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 foo 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 baz 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 bar 2
15 3 bar 3
16 3 bar 4
17 3 bar 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
library(data.table)
library(zoo)
setDT(dat)[order(index),
message := ifelse(na.locf(message, na.rm = FALSE) == na.locf(message, na.rm = FALSE, fromLast = TRUE),
na.locf(message, na.rm = FALSE),
NA),
by = "id"][]
id message index
1: 1 <NA> 1
2: 1 foo 2
3: 1 foo 3
4: 1 foo 4
5: 1 foo 5
6: 1 <NA> 6
7: 2 <NA> 1
8: 2 baz 2
9: 2 baz 3
10: 2 baz 4
11: 2 baz 5
12: 2 baz 6
13: 3 bar 1
14: 3 bar 2
15: 3 bar 3
16: 3 bar 4
17: 3 bar 5
18: 3 bar 6
19: 3 <NA> 7
20: 3 qux 8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.