从 R 中的单个单元格中删除异常值

Question

I am a newbie in R and I am stuck with a problem removing some outliers.我是 R 的新手，我遇到了删除一些异常值的问题。 I have a dataframe which is something like this:我有一个 dataframe 这是这样的：

Item1   Item2   Item3
 4.05    3.9   3.6
 12      3.7   4
 4.01    3.8   4

My desired result should be something like the table below, namely a table where the outliers of every column are removed我想要的结果应该类似于下表，即每列的异常值都被删除的表

Item1  Item2  Item3 
4.05    3.9    3.6
NA      3.7    4
4.01    3.8    4

So far I have written a code which can detect the outliers, but I am stuck with removing them, as the entire column changes instead of the single value.到目前为止，我已经编写了一个可以检测异常值的代码，但是我坚持要删除它们，因为整个列都发生了变化，而不是单个值。

 find_outlier <- function(log_reaction_time) {
media <- mean(log_reaction_time)
devst <- sd(log_reaction_time)
result <-which(log_reaction_time < media - 2 * devst | log_reaction_time > media + 2 * devst)
log_reaction_time2 <- ifelse (log_reaction_time %in% result, NA, log_reaction_time)
}
apply(log_reaction_time, 2, find_outlier)

I guess the problem comes from the fact that I apply the function over the columns (2), as I want to find the outliers of the column, but then I want to remove only the relevant values...我想问题出在我在列 (2) 上应用 function 的事实，因为我想找到列的异常值，但我只想删除相关值......

Answer 1

We will use same dataset to show this:我们将使用相同的数据集来展示这一点：

#Data
df1 <- structure(list(Item1 = c(4.05, 12, 4.01), Item2 = c(3.9, 3.7, 
3.8), Item3 = c(3.6, 4, 4)), class = "data.frame", row.names = c(NA, 
-3L))

df1
  Item1 Item2 Item3
1  4.05   3.9   3.6
2 12.00   3.7   4.0
3  4.01   3.8   4.0

Now the function:现在 function：

#Function
find_outlier <- function(log_reaction_time) {
  media <- mean(log_reaction_time)
  devst <- sd(log_reaction_time)
  result <-which(log_reaction_time < media - 2 * devst | log_reaction_time > media + 2 * devst)
  log_reaction_time[result] <- NA
  return(log_reaction_time)
}

apply(df1, 2, find_outlier)

     Item1 Item2 Item3
[1,]  4.05   3.9   3.6
[2,] 12.00   3.7   4.0
[3,]  4.01   3.8   4.0

To highlight, second value for Item1 is not set to NA because mean(df1$Item1)=6.69 and sd(df1$Item1)=4.60 .要突出显示， Item1的第二个值未设置为NA因为mean(df1$Item1)=6.69和sd(df1$Item1)=4.60 。 So when the condition checks in the intervals you will have mean(df1$Item1)-2*sd(df1$Item1)=-2.51 and mean(df1$Item1)+2*sd(df1$Item1)=15.89 where 12 is not in those limits.因此，当条件检查间隔时，您将有mean(df1$Item1)-2*sd(df1$Item1)=-2.51和mean(df1$Item1)+2*sd(df1$Item1)=15.89其中12是不在这些范围内。 You will have to define other criteria to assign it NA .您将必须定义其他标准来分配它NA 。

Answer 2

Not quite sure which you want but here's a tidyverse solution for either...不太确定你想要哪个，但这里有一个 tidyverse 解决方案...


library(dplyr)

df %>% 
  mutate_all(function(x) ifelse(x < mean(x) - 2 * sd(x) | x > mean(x) + 2 * sd(x) , 
                                NA_real_, 
                                x))
#> # A tibble: 3 x 3
#>   Item1 Item2 Item3
#>   <dbl> <dbl> <dbl>
#> 1  4.05   3.9   3.6
#> 2 12      3.7   4  
#> 3  4.01   3.8   4

media <- mean(as.matrix(df))
devst <- sd(as.matrix(df))

df %>% 
  mutate_all(function(x) ifelse(x < media - 2 * devst | x > media + 2 * devst , 
                                NA_real_, 
                                x))
#> # A tibble: 3 x 3
#>   Item1 Item2 Item3
#>   <dbl> <dbl> <dbl>
#> 1  4.05   3.9   3.6
#> 2 NA      3.7   4  
#> 3  4.01   3.8   4

Your data您的数据

library(readr)
df <- read_table("Item1   Item2   Item3
4.05    3.9   3.6
12      3.7   4
4.01    3.8   4")

Answer 3

Using dplyr , if df is the first data.frame in your post, the following should work:使用dplyr ，如果df是您帖子中的第一个 data.frame ，则以下内容应该有效：

library(dplyr)
df %>%
  mutate(across(everything(), find_outlier)) -> new_df

从 R 中的单个单元格中删除异常值

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-07-14 15:17:14

解决方案2
0 2020-07-14 15:16:22

解决方案3
0 2020-07-14 15:16:51

从 R 中的单个单元格中删除异常值

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-07-14 15:17:14

解决方案2 0 2020-07-14 15:16:22

解决方案3 0 2020-07-14 15:16:51

解决方案1
1 已采纳 2020-07-14 15:17:14

解决方案2
0 2020-07-14 15:16:22

解决方案3
0 2020-07-14 15:16:51