[英]Replace NA with previous or next value, by group, using dplyr
I have a data frame which is arranged by descending order of date.我有一个按日期降序排列的数据框。
ps1 = data.frame(userID = c(21,21,21,22,22,22,23,23,23),
color = c(NA,'blue','red','blue',NA,NA,'red',NA,'gold'),
age = c('3yrs','2yrs',NA,NA,'3yrs',NA,NA,'4yrs',NA),
gender = c('F',NA,'M',NA,NA,'F','F',NA,'F')
)
I wish to impute(replace) NA values with previous values and grouped by userID In case the first row of a userID has NA then replace with the next set of values for that userid group.我希望用以前的值估算(替换)NA 值并按 userID 分组如果 userID 的第一行有 NA,则用该 userid 组的下一组值替换。
I am trying to use dplyr and zoo packages something like this...but its not working我正在尝试使用类似这样的 dplyr 和 zoo 包......但它不起作用
cleanedFUG <- filteredUserGroup %>%
group_by(UserID) %>%
mutate(Age1 = na.locf(Age),
Color1 = na.locf(Color),
Gender1 = na.locf(Gender) )
I need result df like this:我需要这样的结果 df:
userID color age gender
1 21 blue 3yrs F
2 21 blue 2yrs F
3 21 red 2yrs M
4 22 blue 3yrs F
5 22 blue 3yrs F
6 22 blue 3yrs F
7 23 red 4yrs F
8 23 red 4yrs F
9 23 gold 4yrs F
library(tidyr) #fill is part of tidyr
ps1 %>%
group_by(userID) %>%
#fill(color, age, gender) %>% #default direction down
fill(color, age, gender, .direction = "downup")
Which gives you:这给了你:
Source: local data frame [9 x 4]
Groups: userID [3]
userID color age gender
<dbl> <fctr> <fctr> <fctr>
1 21 blue 3yrs F
2 21 blue 2yrs F
3 21 red 2yrs M
4 22 blue 3yrs F
5 22 blue 3yrs F
6 22 blue 3yrs F
7 23 red 4yrs F
8 23 red 4yrs F
9 23 gold 4yrs F
Using zoo::na.locf
directly on the whole data.frame would fill the NA regardless of the userID
groups.无论
userID
组如何,直接在整个 data.frame 上使用zoo::na.locf
都会填充 NA。 Package dplyr's grouping has unfortunately no effect on na.locf
function, that's why I went with a split:不幸的是,包 dplyr 的分组对
na.locf
函数没有影响,这就是我使用拆分的原因:
library(dplyr); library(zoo)
ps1 %>% split(ps1$userID) %>%
lapply(function(x) {na.locf(na.locf(x), fromLast=T)}) %>%
do.call(rbind, .)
#### userID color age gender
#### 21.1 21 blue 3yrs F
#### 21.2 21 blue 2yrs F
#### 21.3 21 red 2yrs M
#### 22.4 22 blue 3yrs F
#### 22.5 22 blue 3yrs F
#### 22.6 22 blue 3yrs F
#### 23.7 23 red 4yrs F
#### 23.8 23 red 4yrs F
#### 23.9 23 gold 4yrs F
What it does is that it first splits the data into 3 data.frames, then I apply a first pass of imputation (downwards), then upwards with the anonymous function in lapply
, and eventually use rbind
to bring the data.frames back together.它的作用是首先将数据拆分为 3 个 data.frames,然后我应用第一次插补(向下),然后使用
lapply
中的匿名函数向上,最后使用rbind
将 data.frames 重新组合在一起。 You have the expected output.你有预期的输出。
I wrote this function and it is definitely faster than fill and probably faster than na.locf:我写了这个函数,它肯定比 fill 快,可能比 na.locf 快:
fill_NA <- function(x) {
which.na <- c(which(!is.na(x)), length(x) + 1)
values <- na.omit(x)
if (which.na[1] != 1) {
which.na <- c(1, which.na)
values <- c(values[1], values)
}
diffs <- diff(which.na)
return(rep(values, times = diffs))
}
Using @agenis method with na.locf()
combined with purrr
, you could do:使用 @agenis 方法与
na.locf()
结合purrr
,你可以这样做:
library(purrr)
library(zoo)
ps1 %>%
slice_rows("userID") %>%
by_slice(function(x) {
na.locf(na.locf(x), fromLast=T) },
.collate = "rows")
A few years down the line, I found that things have changed.几年下来,我发现事情发生了变化。 Using @Steven Beaupré's approach,
使用@Steven Beaupré 的方法,
1) Adding na.rm=F
ensures no rows are deleted/excluded. 1)添加
na.rm=F
确保没有行被删除/排除。 2) The slide_rows()
function can be found in the purrrlyr
package. 2)
slide_rows()
函数可以在purrrlyr
包中找到。
library(purrrlyr)
library(zoo)
ps1 %>%
slice_rows("userID") %>%
by_slice(function(x) {
na.locf(na.locf(x, na.rm=F), fromLast=T, na.rm=F) },
.collate = "rows")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.