简体   繁体   English

如何使用R从单个数据集中的多列中检测和替换异常值?

[英]How to detect and replace outlier from multiple columns in a single data set using R?

I am trying to find and replace outliers from multiple numeric columns.我正在尝试从多个数字列中查找和替换异常值。 This is not the best practice in my humble opinion, but it is something I'm attempting to figure out for specific use cases.以我的拙见,这不是最佳实践,但这是我试图针对特定用例找出的方法。 A great example of creating an additional column that labels a row as an outlier can be found here but it is based on a single column.可以在此处找到创建附加列的一个很好的示例,该列将行标记为异常值它基于单个列。

My data looks as follows (for simplicity, I excluded columns with factors):我的数据如下(为简单起见,我排除了包含因子的列):

   Row ID   Value1 Value2
      1        6      1
      2        2     200
      3      100      3
      4        1      4
      5      250      5
      6        2      6
      7        8     300
      8      600     300
      9        2      9

I used a function to replace outliers with NA in all numeric columns:我使用一个函数在所有数字列中用 NA 替换异常值:

replaceOuts = function(df) {
    map_if(df, is.numeric, 
           ~ replace(.x, .x %in% boxplot.stats(.x)$out, NA)) %>% 
    bind_cols 
}
test = replaceOuts(df)

My question is how can I replace the outliers with another value (eg, mean, median, capped value, etc.)?我的问题是如何用另一个值(例如,均值、中值、上限值等)替换异常值? Any help would be appreciated!任何帮助,将不胜感激!

Instead of NA you could replace the value with mean or median whatever you prefer.您可以用您喜欢的meanmedian替换该值,而不是NA

library(dplyr)
library(purrr)

replaceOuts = function(df) {
   map_if(df, is.numeric, 
          ~ replace(.x, .x %in% boxplot.stats(.x)$out, mean(.x))) %>%
   bind_cols 
}

replaceOuts(df)

# RowID Value1 Value2
#  <dbl>  <dbl>  <dbl>
#1     1     6       1
#2     2     2     200
#3     3   100       3
#4     4     1       4
#5     5   108.      5
#6     6     2       6
#7     7     8     300
#8     8   108.    300
#9     9     2       9

Replace mean with median to any other function that you want.mean替换为您想要的任何其他函数的median

PS - I think it is better to use mutate_if instead of map_if here since it avoids bind_cols at the end. PS - 我认为最好在这里使用mutate_if而不是map_if ,因为它最终避免了bind_cols

df %>% mutate_if(is.numeric, ~replace(., . %in% boxplot.stats(.)$out, mean(.)))

I think you need minVal and maxMax treshold values.我认为您需要 minVal 和 maxMax 阈值。 And then replace values out of range (minVal, maxVal) with any value in myValue (mean, median o what you need)然后用 myValue 中的任何值替换超出范围(minVal、maxVal)的值(平均值、中位数或您需要的值)

# Could be any value for limits, i.e. 
minVal <- boxplot.stats(data$columnX)$stats[1]
maxVal <- boxplot.stats(data$columnX)$stats[5]
myValue <- median(data$columnX)

data[data$columnX < minVal | data$columnX > maxVal, "columnX"] <- myValue   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM