简体   繁体   中英

How to detect and replace outlier from multiple columns in a single data set using R?

I am trying to find and replace outliers from multiple numeric columns. This is not the best practice in my humble opinion, but it is something I'm attempting to figure out for specific use cases. A great example of creating an additional column that labels a row as an outlier can be found here but it is based on a single column.

My data looks as follows (for simplicity, I excluded columns with factors):

   Row ID   Value1 Value2
      1        6      1
      2        2     200
      3      100      3
      4        1      4
      5      250      5
      6        2      6
      7        8     300
      8      600     300
      9        2      9

I used a function to replace outliers with NA in all numeric columns:

replaceOuts = function(df) {
    map_if(df, is.numeric, 
           ~ replace(.x, .x %in% boxplot.stats(.x)$out, NA)) %>% 
    bind_cols 
}
test = replaceOuts(df)

My question is how can I replace the outliers with another value (eg, mean, median, capped value, etc.)? Any help would be appreciated!

Instead of NA you could replace the value with mean or median whatever you prefer.

library(dplyr)
library(purrr)

replaceOuts = function(df) {
   map_if(df, is.numeric, 
          ~ replace(.x, .x %in% boxplot.stats(.x)$out, mean(.x))) %>%
   bind_cols 
}

replaceOuts(df)

# RowID Value1 Value2
#  <dbl>  <dbl>  <dbl>
#1     1     6       1
#2     2     2     200
#3     3   100       3
#4     4     1       4
#5     5   108.      5
#6     6     2       6
#7     7     8     300
#8     8   108.    300
#9     9     2       9

Replace mean with median to any other function that you want.

PS - I think it is better to use mutate_if instead of map_if here since it avoids bind_cols at the end.

df %>% mutate_if(is.numeric, ~replace(., . %in% boxplot.stats(.)$out, mean(.)))

I think you need minVal and maxMax treshold values. And then replace values out of range (minVal, maxVal) with any value in myValue (mean, median o what you need)

# Could be any value for limits, i.e. 
minVal <- boxplot.stats(data$columnX)$stats[1]
maxVal <- boxplot.stats(data$columnX)$stats[5]
myValue <- median(data$columnX)

data[data$columnX < minVal | data$columnX > maxVal, "columnX"] <- myValue   

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM