简体   繁体   English

R,dplyr:如何根据group_by的大小来替换0值

[英]R,dplyr: How to replace 0 values based conditional on size of group_by

I am trying to replace the 0 values in a column based conditionally on the size of their group_by with the median value of the group for a large data set. 我试图根据group_by的大小有条件地替换列中的0值和大数据集的组的中值。

set.seed(10000)
Data <- data.frame(
    X = as.numeric(c(0,2,3,4,5,6,7,8,9,0)),
    Y = c("no","yes","yes","yes","yes","yes","yes","yes","yes","yes"),
    Z = c(F,T,T,T,T,F,F,F,T,T)
)

# change 0 in the 10 spot to median
Data <- Data %>%
    # group by Y and Z then
    group_by(Y,Z) %>%
    # if the size of the group is less than 2 and if X is NA change it to 10
    # else leave it as X else (if group size 2 or greater) leave value as NA then
    mutate(X = ifelse(n()<2,ifelse(X==0,median(X),X),X)) 

# change 0 in 1 spot to median
Data <- Data %>%
    # group by Y then
    group_by(Y) %>%
    # if the size of the group is larger than 2 and if X is NA change it to 1
    # else leave is as X else(if group size 3 or larger) leave value as X
    mutate(X = ifelse(n()<3,ifelse(X==0,median(X),X),X))

Resulting in error: 导致错误:

Error in n > 1 : n> 1时出错:

comparison (6) is possible only for atomic and list types 比较(6)仅适用于原子和列表类型

I am expecting column X to be the sequence of 1:10 after the above code. 我希望在上面的代码之后,列X是1:10的序列。

This is a generalization of a problem I am having with a large data set where I am trying to impute 0 values as the median of different group bys conditional on the size of the group and I am getting the same error as above. 这是我对大型数据集的问题的概括,其中我试图将0值作为不同组别的中位数,以组的大小为条件,并且我得到与上面相同的错误。

See if this works for you: 看看这是否适合您:

library(tidyverse)

set.seed(10000)
Data <- data.frame(
  X = c(NA,2,3,4,5,6,7,8,9,NA),
  Y = c("yes","yes","yes","yes","yes","yes","yes","yes","yes","no"),
  Z = c(T,F,F,F,F,F,F,F,F,T)
)

# change NA in the 10 spot to 10
Data %>%
  group_by(Y) %>%
  mutate(count = n()) %>%
  mutate(X = ifelse(count < 2, ifelse(is.na(X), 10, X), NA)) %>%
  select(-count)


# change NA in 1 spot to 1
Data %>%
  group_by(Y,Z) %>%
  mutate(count = n()) %>%
  mutate(X = ifelse(count < 3, ifelse(is.na(X), 1, X), X)) %>%
  select(-count)



# You can bypass the count column 
Data %>%
  group_by(Y) %>%
  mutate(X = ifelse(n() < 2, ifelse(is.na(X), 10, X), NA)) 

I can't find how to answer with your exact question, but I hope this sets you in the right direction (it's also a data.table solution). 我找不到如何用你的确切问题回答,但我希望这能让你朝着正确的方向前进(它也是一个data.table解决方案)。

Assuming you want the mean of the column instead of any NA , depending on the size of the group, there's a function from the zoo package that can be of help: 假设您想要列的mean而不是任何NA ,取决于组的大小, zoo包中的函数可以提供帮助:

# load libraries

library(zoo)
library(data.table)

# convert Data to a data.table

setDT(Data)

Now, we'll use the function zoo::na.aggregate to replace with the mean any NA. 现在,我们将使用函数zoo::na.aggregate来替换任何NA的mean But we need to introduce the size of the group as a condition. 但我们需要引入组的大小作为条件。 So I'll go step-by-step first: 所以我会先一步一步走:

# create a column with the number of elements in the group. It'll be removed later:

Data[, n:= .N, by = Y]

# Create a new X column with the NAs replaced by the mean, in case the group is larger than 2, or an arbitrary number -I choosed 100-, if the group is less or equal than 2:

Data[, newX := ifelse(n >2, na.aggregate(x), 100), by = Y]

# Now you can optionally copy newX to X:

Data[, X := newX]

# and delete n and newX:

Data[, c("n", "newX") := NULL]

Of course you could have jumped the X := newX part by assingning directly to X , but it considered it a bit more obscure than the step by step process. 当然你可以直接向X跳过X := newX部分,但它认为它比一步一步的过程更加模糊。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM