简体   繁体   中英

R,dplyr: How to replace 0 values based conditional on size of group_by

I am trying to replace the 0 values in a column based conditionally on the size of their group_by with the median value of the group for a large data set.

set.seed(10000)
Data <- data.frame(
    X = as.numeric(c(0,2,3,4,5,6,7,8,9,0)),
    Y = c("no","yes","yes","yes","yes","yes","yes","yes","yes","yes"),
    Z = c(F,T,T,T,T,F,F,F,T,T)
)

# change 0 in the 10 spot to median
Data <- Data %>%
    # group by Y and Z then
    group_by(Y,Z) %>%
    # if the size of the group is less than 2 and if X is NA change it to 10
    # else leave it as X else (if group size 2 or greater) leave value as NA then
    mutate(X = ifelse(n()<2,ifelse(X==0,median(X),X),X)) 

# change 0 in 1 spot to median
Data <- Data %>%
    # group by Y then
    group_by(Y) %>%
    # if the size of the group is larger than 2 and if X is NA change it to 1
    # else leave is as X else(if group size 3 or larger) leave value as X
    mutate(X = ifelse(n()<3,ifelse(X==0,median(X),X),X))

Resulting in error:

Error in n > 1 :

comparison (6) is possible only for atomic and list types

I am expecting column X to be the sequence of 1:10 after the above code.

This is a generalization of a problem I am having with a large data set where I am trying to impute 0 values as the median of different group bys conditional on the size of the group and I am getting the same error as above.

See if this works for you:

library(tidyverse)

set.seed(10000)
Data <- data.frame(
  X = c(NA,2,3,4,5,6,7,8,9,NA),
  Y = c("yes","yes","yes","yes","yes","yes","yes","yes","yes","no"),
  Z = c(T,F,F,F,F,F,F,F,F,T)
)

# change NA in the 10 spot to 10
Data %>%
  group_by(Y) %>%
  mutate(count = n()) %>%
  mutate(X = ifelse(count < 2, ifelse(is.na(X), 10, X), NA)) %>%
  select(-count)


# change NA in 1 spot to 1
Data %>%
  group_by(Y,Z) %>%
  mutate(count = n()) %>%
  mutate(X = ifelse(count < 3, ifelse(is.na(X), 1, X), X)) %>%
  select(-count)



# You can bypass the count column 
Data %>%
  group_by(Y) %>%
  mutate(X = ifelse(n() < 2, ifelse(is.na(X), 10, X), NA)) 

I can't find how to answer with your exact question, but I hope this sets you in the right direction (it's also a data.table solution).

Assuming you want the mean of the column instead of any NA , depending on the size of the group, there's a function from the zoo package that can be of help:

# load libraries

library(zoo)
library(data.table)

# convert Data to a data.table

setDT(Data)

Now, we'll use the function zoo::na.aggregate to replace with the mean any NA. But we need to introduce the size of the group as a condition. So I'll go step-by-step first:

# create a column with the number of elements in the group. It'll be removed later:

Data[, n:= .N, by = Y]

# Create a new X column with the NAs replaced by the mean, in case the group is larger than 2, or an arbitrary number -I choosed 100-, if the group is less or equal than 2:

Data[, newX := ifelse(n >2, na.aggregate(x), 100), by = Y]

# Now you can optionally copy newX to X:

Data[, X := newX]

# and delete n and newX:

Data[, c("n", "newX") := NULL]

Of course you could have jumped the X := newX part by assingning directly to X , but it considered it a bit more obscure than the step by step process.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM