如何删除数据框每一列中的异常值（与平均值相差 3 个标准差的数字）

Question

I have a dataset with participant IDS and 17 different measures for each participant.我有一个包含参与者 IDS 的数据集和每个参与者的 17 种不同度量。

I need to remove outliers- numbers that are 3 standard deviations away from the mean both sides.我需要删除离均值 3 个标准差的异常值。 This needs to happen for each column individually.这需要为每一列单独发生。

So far, by using the code below I have managed to add NA to an outlier column for each column, but it doesn't help me much, since I need to be able to either add NA to the column with the rest of the numbers or simply remove the outlier number到目前为止，通过使用下面的代码，我已经设法将 NA 添加到每列的异常值列中，但这对我帮助不大，因为我需要能够将 NA 添加到带有其余数字的列中或者简单地删除异常值

Ideally I want to get to a file that looks like this:理想情况下，我想获得一个如下所示的文件：

ID measure1 measure2 ....measure17
1  10897                  64436
2  184658    1739473
3            75758
4  746483    4327349      3612638
5  6444      36363        46447

Code I have used so far:到目前为止我使用过的代码：

phenotypes <- colnames(imaging_data_kept[,2:ncol(imaging_data_kept)])

 for (i in phenotypes){
  Min <- mean(imaging_data_kept[[i]]) - (3*sd(imaging_data_kept[[i]]))
  Max <- mean(imaging_data_kept[[i]]) + (3*sd(imaging_data_kept[[i]]))  
  imaging_data_kept[[paste0(i,"_outliers")]] <- imaging_data_kept[[i]] < 
  Min | imaging_data_kept[[i]] > Max
 }

Sample data:样本数据：

SubjID M1 M2 M3 M4 M5 
1000496 14898.1 9172 4902 5921.9 1428.2 
1001121 5420.7 2855.5 4144 732.1 4960.2 
1001468 7478.8 3401.4 5143.6 1106.5 4355.5 
1004960 11316.4 8460.1 3953.4 5682.2 1717 
1005040 15052.7 6362.8 3145.2 4593 1214.5  
1005677 17883.3 6705.1 3943.5 4993.1 1373.1 
1006128 6260.8 4274.6 5865 2002.3 4727.1 
1006694 9292.8 3389.9 5141.6 1246.6 4135.7 
1009080 10391.3 8372.1 2921.8 4008.6 860.4 
1010482 9381.5 2743.4 4526.5 1160.4 3655.1 
1011508 15598.5 7365.7 4279.4 6274.1 1757.1

Answer 1

This will replace values more than 3 SD from the mean with NA:这将用 NA 替换平均值超过 3 SD 的值：

dd[,-1] <- lapply(dd[,-1],
      function(x) replace(x,abs(scale(x))>3,NA))

(The scale() function computes (x-mean(x))/sd(x) ; abs(scale(x))>3 should be reasonably self-explanatory; replace() replaces a specified set of indices with the indicated value.) （ scale()函数计算(x-mean(x))/sd(x) ； abs(scale(x))>3应该是合理的不言自明； replace()用指定的值替换指定的一组索引.)

You can then use na.omit(dd) if you want to drop all rows that contain outliers in any column.如果要删除任何列中包含异常值的所有行，则可以使用na.omit(dd) 。

The sample data you gave us doesn't appear to have any outliers (according to your definition) -- I added some.您提供给我们的样本数据似乎没有任何异常值（根据您的定义）——我添加了一些。

dd <- read.table(header=TRUE,
                 colClasses=c("character",rep("numeric",5)),
                 text="
SubjID M1 M2 M3 M4 M5 
1000496 14898.1 9172 4902 5921.9 1428.2 
1001121 5420.7 2855.5 4144 732.1 100000
1001468 7478.8 3401.4 5143.6 1106.5 4355.5 
1004960 11316.4 8460.1 3953.4 5682.2 1717 
1005040 15052.7 6362.8 3145.2 4593 1214.5  
1005677 17883.3 6705.1 100000 4993.1 1373.1 
1006128 6260.8 4274.6 5865 2002.3 4727.1 
1006694 9292.8 3389.9 5141.6 1246.6 4135.7 
1009080 10391.3 8372.1 2921.8 4008.6 860.4 
1010482 9381.5 2743.4 4526.5 1000000 3655.1 
1011508 15598.5 7365.7 4279.4 6274.1 1757.1
")

Answer 2

I recommend using the boxplot() - function, which calculates outliers.我建议使用boxplot() - 函数，它计算异常值。 You can acces them in your boxplot -object via boxplot$out or get the quantiles via boxplot$stats .您可以通过boxplot$out在您的boxplot -object 中访问它们或通过boxplot$stats获取分位数。 Which is what I'm doing next.这就是我接下来要做的。

But beware that boxplot does not calculate outliers in terms of 3 standard deviations but with Q1 - 1.5*IQR and Q3 + 1.5*IQR respectively.但请注意，箱线图不会根据 3 个标准差计算异常值，而是分别使用Q1 - 1.5*IQR和Q3 + 1.5*IQR 。


library(dplyr) # for the pipe operators

#creating sample data 
df <- data.frame("var1" = c(-20.32, -15.29, rnorm(5,1,1), 11.23, 20.45),
                 "var2" = c(-12.43, -3.12, rnorm(5, 1,1), 10.75, 18.11))

#looks like that
> df
         var1        var2
1 -20.3200000 -12.4300000
2 -15.2900000  -3.1200000
3   0.9950276   1.2645415
4   1.7022687   0.8313770
5   1.8828154  -0.7459769
6   1.2299670   0.5053378
7   0.2749259   2.0239793
8  11.2300000  10.7500000
9  20.4500000  18.1100000

#remove outliers
nooutliers <- lapply(df, function(x) boxplot(df, plot = FALSE)) %>%
                lapply(`[`, "stats") %>% 
                  lapply(range) %>%
                    mapply(function (x,y) !between(x, y[1], y[2]), df, .) %>%
                      as.data.frame %>%
                        mapply(function(x,y) {y[x] <- NA; y},  
                               y = df, x = .)

#looks like this now
> nooutliers
           var1       var2
 [1,]        NA         NA
 [2,]        NA -3.1200000
 [3,] 0.9950276  1.2645415
 [4,] 1.7022687  0.8313770
 [5,] 1.8828154 -0.7459769
 [6,] 1.2299670  0.5053378
 [7,] 0.2749259  2.0239793
 [8,]        NA         NA
 [9,]        NA         NA

This code calculates the range within the whiskers for each column, assigns NA to all values outside of this range and returns a matrix.此代码计算每列胡须内的范围，将NA分配给此范围之外的所有值并返回一个矩阵。

I suppose this is what you're looking for.我想这就是你要找的。

UPDATE: With 3 standard deviations:更新：有 3 个标准差：

df <- data.frame("var1" = c(-210.32, rnorm(20,1,1), 234.45),
                 "var2" = c(-230.43, rnorm(20, 1,1), 213.11))


phenotypes <- colnames(df)

for (i in phenotypes){
  Min <- mean(df[[i]]) - (3*sd(df[[i]]))
  Max <- mean(df[[i]]) + (3*sd(df[[i]]))  
  df[[i]][df[[i]] < Min | df[[i]] > Max] <- NA}

This adopts your outlier definition.这采用了您的异常值定义。

如何删除数据框每一列中的异常值（与平均值相差 3 个标准差的数字）

问题描述

2 个解决方案

解决方案1
4 2019-03-28 13:22:56

解决方案2
3 2019-03-25 14:00:02

如何删除数据框每一列中的异常值（与平均值相差 3 个标准差的数字）

问题描述

2 个解决方案

解决方案1 4 2019-03-28 13:22:56

解决方案2 3 2019-03-25 14:00:02

解决方案1
4 2019-03-28 13:22:56

解决方案2
3 2019-03-25 14:00:02