R，dplyr：如何根据group_by的大小来替换0值

Question

I am trying to replace the 0 values in a column based conditionally on the size of their group_by with the median value of the group for a large data set. 我试图根据group_by的大小有条件地替换列中的0值和大数据集的组的中值。

set.seed(10000)
Data <- data.frame(
    X = as.numeric(c(0,2,3,4,5,6,7,8,9,0)),
    Y = c("no","yes","yes","yes","yes","yes","yes","yes","yes","yes"),
    Z = c(F,T,T,T,T,F,F,F,T,T)
)

# change 0 in the 10 spot to median
Data <- Data %>%
    # group by Y and Z then
    group_by(Y,Z) %>%
    # if the size of the group is less than 2 and if X is NA change it to 10
    # else leave it as X else (if group size 2 or greater) leave value as NA then
    mutate(X = ifelse(n()<2,ifelse(X==0,median(X),X),X)) 

# change 0 in 1 spot to median
Data <- Data %>%
    # group by Y then
    group_by(Y) %>%
    # if the size of the group is larger than 2 and if X is NA change it to 1
    # else leave is as X else(if group size 3 or larger) leave value as X
    mutate(X = ifelse(n()<3,ifelse(X==0,median(X),X),X))

Resulting in error: 导致错误：

Error in n > 1 : n> 1时出错：

comparison (6) is possible only for atomic and list types 比较（6）仅适用于原子和列表类型

I am expecting column X to be the sequence of 1:10 after the above code. 我希望在上面的代码之后，列X是1:10的序列。

This is a generalization of a problem I am having with a large data set where I am trying to impute 0 values as the median of different group bys conditional on the size of the group and I am getting the same error as above. 这是我对大型数据集的问题的概括，其中我试图将0值作为不同组别的中位数，以组的大小为条件，并且我得到与上面相同的错误。

Answer 1

See if this works for you: 看看这是否适合您：

library(tidyverse)

set.seed(10000)
Data <- data.frame(
  X = c(NA,2,3,4,5,6,7,8,9,NA),
  Y = c("yes","yes","yes","yes","yes","yes","yes","yes","yes","no"),
  Z = c(T,F,F,F,F,F,F,F,F,T)
)

# change NA in the 10 spot to 10
Data %>%
  group_by(Y) %>%
  mutate(count = n()) %>%
  mutate(X = ifelse(count < 2, ifelse(is.na(X), 10, X), NA)) %>%
  select(-count)


# change NA in 1 spot to 1
Data %>%
  group_by(Y,Z) %>%
  mutate(count = n()) %>%
  mutate(X = ifelse(count < 3, ifelse(is.na(X), 1, X), X)) %>%
  select(-count)



# You can bypass the count column 
Data %>%
  group_by(Y) %>%
  mutate(X = ifelse(n() < 2, ifelse(is.na(X), 10, X), NA))

Answer 2

I can't find how to answer with your exact question, but I hope this sets you in the right direction (it's also a data.table solution). 我找不到如何用你的确切问题回答，但我希望这能让你朝着正确的方向前进（它也是一个data.table解决方案）。

Assuming you want the mean of the column instead of any NA , depending on the size of the group, there's a function from the zoo package that can be of help: 假设您想要列的mean而不是任何NA ，取决于组的大小， zoo包中的函数可以提供帮助：

# load libraries

library(zoo)
library(data.table)

# convert Data to a data.table

setDT(Data)

Now, we'll use the function zoo::na.aggregate to replace with the mean any NA. 现在，我们将使用函数zoo::na.aggregate来替换任何NA的mean 。 But we need to introduce the size of the group as a condition. 但我们需要引入组的大小作为条件。 So I'll go step-by-step first: 所以我会先一步一步走：

# create a column with the number of elements in the group. It'll be removed later:

Data[, n:= .N, by = Y]

# Create a new X column with the NAs replaced by the mean, in case the group is larger than 2, or an arbitrary number -I choosed 100-, if the group is less or equal than 2:

Data[, newX := ifelse(n >2, na.aggregate(x), 100), by = Y]

# Now you can optionally copy newX to X:

Data[, X := newX]

# and delete n and newX:

Data[, c("n", "newX") := NULL]

Of course you could have jumped the X := newX part by assingning directly to X , but it considered it a bit more obscure than the step by step process. 当然你可以直接向X跳过X := newX部分，但它认为它比一步一步的过程更加模糊。

R，dplyr：如何根据group_by的大小来替换0值

问题描述

2 个解决方案

解决方案1
0 2019-04-16 23:52:23

解决方案2
0 2019-04-17 02:24:31

R，dplyr：如何根据group_by的大小来替换0值

问题描述

2 个解决方案

解决方案1 0 2019-04-16 23:52:23

解决方案2 0 2019-04-17 02:24:31

解决方案1
0 2019-04-16 23:52:23

解决方案2
0 2019-04-17 02:24:31