[英]R,dplyr: How to replace 0 values based conditional on size of group_by
I am trying to replace the 0 values in a column based conditionally on the size of their group_by with the median value of the group for a large data set. 我试图根据group_by的大小有条件地替换列中的0值和大数据集的组的中值。
set.seed(10000)
Data <- data.frame(
X = as.numeric(c(0,2,3,4,5,6,7,8,9,0)),
Y = c("no","yes","yes","yes","yes","yes","yes","yes","yes","yes"),
Z = c(F,T,T,T,T,F,F,F,T,T)
)
# change 0 in the 10 spot to median
Data <- Data %>%
# group by Y and Z then
group_by(Y,Z) %>%
# if the size of the group is less than 2 and if X is NA change it to 10
# else leave it as X else (if group size 2 or greater) leave value as NA then
mutate(X = ifelse(n()<2,ifelse(X==0,median(X),X),X))
# change 0 in 1 spot to median
Data <- Data %>%
# group by Y then
group_by(Y) %>%
# if the size of the group is larger than 2 and if X is NA change it to 1
# else leave is as X else(if group size 3 or larger) leave value as X
mutate(X = ifelse(n()<3,ifelse(X==0,median(X),X),X))
Resulting in error: 导致错误:
Error in n > 1 : n> 1时出错:
comparison (6) is possible only for atomic and list types 比较(6)仅适用于原子和列表类型
I am expecting column X to be the sequence of 1:10 after the above code. 我希望在上面的代码之后,列X是1:10的序列。
This is a generalization of a problem I am having with a large data set where I am trying to impute 0 values as the median of different group bys conditional on the size of the group and I am getting the same error as above. 这是我对大型数据集的问题的概括,其中我试图将0值作为不同组别的中位数,以组的大小为条件,并且我得到与上面相同的错误。
See if this works for you: 看看这是否适合您:
library(tidyverse)
set.seed(10000)
Data <- data.frame(
X = c(NA,2,3,4,5,6,7,8,9,NA),
Y = c("yes","yes","yes","yes","yes","yes","yes","yes","yes","no"),
Z = c(T,F,F,F,F,F,F,F,F,T)
)
# change NA in the 10 spot to 10
Data %>%
group_by(Y) %>%
mutate(count = n()) %>%
mutate(X = ifelse(count < 2, ifelse(is.na(X), 10, X), NA)) %>%
select(-count)
# change NA in 1 spot to 1
Data %>%
group_by(Y,Z) %>%
mutate(count = n()) %>%
mutate(X = ifelse(count < 3, ifelse(is.na(X), 1, X), X)) %>%
select(-count)
# You can bypass the count column
Data %>%
group_by(Y) %>%
mutate(X = ifelse(n() < 2, ifelse(is.na(X), 10, X), NA))
I can't find how to answer with your exact question, but I hope this sets you in the right direction (it's also a data.table
solution). 我找不到如何用你的确切问题回答,但我希望这能让你朝着正确的方向前进(它也是一个
data.table
解决方案)。
Assuming you want the mean
of the column instead of any NA
, depending on the size of the group, there's a function from the zoo
package that can be of help: 假设您想要列的
mean
而不是任何NA
,取决于组的大小, zoo
包中的函数可以提供帮助:
# load libraries
library(zoo)
library(data.table)
# convert Data to a data.table
setDT(Data)
Now, we'll use the function zoo::na.aggregate
to replace with the mean
any NA. 现在,我们将使用函数
zoo::na.aggregate
来替换任何NA的mean
。 But we need to introduce the size of the group as a condition. 但我们需要引入组的大小作为条件。 So I'll go step-by-step first:
所以我会先一步一步走:
# create a column with the number of elements in the group. It'll be removed later:
Data[, n:= .N, by = Y]
# Create a new X column with the NAs replaced by the mean, in case the group is larger than 2, or an arbitrary number -I choosed 100-, if the group is less or equal than 2:
Data[, newX := ifelse(n >2, na.aggregate(x), 100), by = Y]
# Now you can optionally copy newX to X:
Data[, X := newX]
# and delete n and newX:
Data[, c("n", "newX") := NULL]
Of course you could have jumped the X := newX
part by assingning directly to X
, but it considered it a bit more obscure than the step by step process. 当然你可以直接向
X
跳过X := newX
部分,但它认为它比一步一步的过程更加模糊。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.