简体   繁体   中英

How to conditionally replace values with NA across multiple columns

I would like to replace outliers in each column of a dataframe with NA.

If for example we define outliers as being any value greater than 3 standard deviations from the mean I can achieve this per variable with the code below.

Rather than specify each column individually I'd like to perform the same operation on all columns of df in one call. Any pointers on how to do this?!

Thanks!

library(dplyr)
data("iris")
df <- iris %>% 
  select(Sepal.Length, Sepal.Width, Petal.Length)%>% 
  head(10) 

# add a clear outlier to each variable
df[1, 1:3] = 99

# replace values above 3 SD's with NA
df_cleaned <- df %>% 
  mutate(Sepal.Length = replace(Sepal.Length, Sepal.Length > (abs(3 * sd(df$Sepal.Length, na.rm = TRUE))), NA))

You need to use mutate_all() , ie

library(dplyr)

df %>% 
 mutate_all(funs(replace(., . > (abs(3 * sd(., na.rm = TRUE))), NA)))

Another option is base R

df[] <- lapply(df, function(x) replace(x, . > (abs(3 * sd(x, na.rm = TRUE))), NA))

or with colSds from matrixStats

library(matrixStats)
df[df > abs(3 * colSds(as.matrix(df), na.rm = TRUE))] <- NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM