简体   繁体   English

如何删除数据框每一列中的异常值(与平均值相差 3 个标准差的数字)

[英]How can I remove outliers (numbers 3 standard deviations away from the mean) in each column of a data frame

I have a dataset with participant IDS and 17 different measures for each participant.我有一个包含参与者 IDS 的数据集和每个参与者的 17 种不同度量。

I need to remove outliers- numbers that are 3 standard deviations away from the mean both sides.我需要删除离均值 3 个标准差的异常值。 This needs to happen for each column individually.这需要为每一列单独发生。

So far, by using the code below I have managed to add NA to an outlier column for each column, but it doesn't help me much, since I need to be able to either add NA to the column with the rest of the numbers or simply remove the outlier number到目前为止,通过使用下面的代码,我已经设法将 NA 添加到每列的异常值列中,但这对我帮助不大,因为我需要能够将 NA 添加到带有其余数字的列中或者简单地删除异常值

Ideally I want to get to a file that looks like this:理想情况下,我想获得一个如下所示的文件:

ID measure1 measure2 ....measure17
1  10897                  64436
2  184658    1739473
3            75758
4  746483    4327349      3612638
5  6444      36363        46447

Code I have used so far:到目前为止我使用过的代码:

phenotypes <- colnames(imaging_data_kept[,2:ncol(imaging_data_kept)])

 for (i in phenotypes){
  Min <- mean(imaging_data_kept[[i]]) - (3*sd(imaging_data_kept[[i]]))
  Max <- mean(imaging_data_kept[[i]]) + (3*sd(imaging_data_kept[[i]]))  
  imaging_data_kept[[paste0(i,"_outliers")]] <- imaging_data_kept[[i]] < 
  Min | imaging_data_kept[[i]] > Max
 }

Sample data:样本数据:

SubjID M1 M2 M3 M4 M5 
1000496 14898.1 9172 4902 5921.9 1428.2 
1001121 5420.7 2855.5 4144 732.1 4960.2 
1001468 7478.8 3401.4 5143.6 1106.5 4355.5 
1004960 11316.4 8460.1 3953.4 5682.2 1717 
1005040 15052.7 6362.8 3145.2 4593 1214.5  
1005677 17883.3 6705.1 3943.5 4993.1 1373.1 
1006128 6260.8 4274.6 5865 2002.3 4727.1 
1006694 9292.8 3389.9 5141.6 1246.6 4135.7 
1009080 10391.3 8372.1 2921.8 4008.6 860.4 
1010482 9381.5 2743.4 4526.5 1160.4 3655.1 
1011508 15598.5 7365.7 4279.4 6274.1 1757.1 

This will replace values more than 3 SD from the mean with NA:这将用 NA 替换平均值超过 3 SD 的值:

dd[,-1] <- lapply(dd[,-1],
      function(x) replace(x,abs(scale(x))>3,NA))

(The scale() function computes (x-mean(x))/sd(x) ; abs(scale(x))>3 should be reasonably self-explanatory; replace() replaces a specified set of indices with the indicated value.) scale()函数计算(x-mean(x))/sd(x)abs(scale(x))>3应该是合理的不言自明; replace()用指定的值替换指定的一组索引.)

You can then use na.omit(dd) if you want to drop all rows that contain outliers in any column.如果要删除任何列中包含异常值的所有行,则可以使用na.omit(dd)

The sample data you gave us doesn't appear to have any outliers (according to your definition) -- I added some.您提供给我们的样本数据似乎没有任何异常值(根据您的定义)——我添加了一些。


dd <- read.table(header=TRUE,
                 colClasses=c("character",rep("numeric",5)),
                 text="
SubjID M1 M2 M3 M4 M5 
1000496 14898.1 9172 4902 5921.9 1428.2 
1001121 5420.7 2855.5 4144 732.1 100000
1001468 7478.8 3401.4 5143.6 1106.5 4355.5 
1004960 11316.4 8460.1 3953.4 5682.2 1717 
1005040 15052.7 6362.8 3145.2 4593 1214.5  
1005677 17883.3 6705.1 100000 4993.1 1373.1 
1006128 6260.8 4274.6 5865 2002.3 4727.1 
1006694 9292.8 3389.9 5141.6 1246.6 4135.7 
1009080 10391.3 8372.1 2921.8 4008.6 860.4 
1010482 9381.5 2743.4 4526.5 1000000 3655.1 
1011508 15598.5 7365.7 4279.4 6274.1 1757.1
")

I recommend using the boxplot() - function, which calculates outliers.我建议使用boxplot() - 函数,它计算异常值。 You can acces them in your boxplot -object via boxplot$out or get the quantiles via boxplot$stats .您可以通过boxplot$out在您的boxplot -object 中访问它们或通过boxplot$stats获取分位数。 Which is what I'm doing next.这就是我接下来要做的。

But beware that boxplot does not calculate outliers in terms of 3 standard deviations but with Q1 - 1.5*IQR and Q3 + 1.5*IQR respectively.但请注意,箱线图不会根据 3 个标准差计算异常值,而是分别使用Q1 - 1.5*IQRQ3 + 1.5*IQR


library(dplyr) # for the pipe operators

#creating sample data 
df <- data.frame("var1" = c(-20.32, -15.29, rnorm(5,1,1), 11.23, 20.45),
                 "var2" = c(-12.43, -3.12, rnorm(5, 1,1), 10.75, 18.11))

#looks like that
> df
         var1        var2
1 -20.3200000 -12.4300000
2 -15.2900000  -3.1200000
3   0.9950276   1.2645415
4   1.7022687   0.8313770
5   1.8828154  -0.7459769
6   1.2299670   0.5053378
7   0.2749259   2.0239793
8  11.2300000  10.7500000
9  20.4500000  18.1100000

#remove outliers
nooutliers <- lapply(df, function(x) boxplot(df, plot = FALSE)) %>%
                lapply(`[`, "stats") %>% 
                  lapply(range) %>%
                    mapply(function (x,y) !between(x, y[1], y[2]), df, .) %>%
                      as.data.frame %>%
                        mapply(function(x,y) {y[x] <- NA; y},  
                               y = df, x = .)

#looks like this now
> nooutliers
           var1       var2
 [1,]        NA         NA
 [2,]        NA -3.1200000
 [3,] 0.9950276  1.2645415
 [4,] 1.7022687  0.8313770
 [5,] 1.8828154 -0.7459769
 [6,] 1.2299670  0.5053378
 [7,] 0.2749259  2.0239793
 [8,]        NA         NA
 [9,]        NA         NA

This code calculates the range within the whiskers for each column, assigns NA to all values outside of this range and returns a matrix.此代码计算每列胡须内的范围,将NA分配给此范围之外的所有值并返回一个矩阵。

I suppose this is what you're looking for.我想这就是你要找的。

UPDATE: With 3 standard deviations:更新:有 3 个标准差:

df <- data.frame("var1" = c(-210.32, rnorm(20,1,1), 234.45),
                 "var2" = c(-230.43, rnorm(20, 1,1), 213.11))


phenotypes <- colnames(df)

for (i in phenotypes){
  Min <- mean(df[[i]]) - (3*sd(df[[i]]))
  Max <- mean(df[[i]]) + (3*sd(df[[i]]))  
  df[[i]][df[[i]] < Min | df[[i]] > Max] <- NA}

This adopts your outlier definition.这采用了您的异常值定义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在r的数据帧中找到离均值的特定标准偏差以外的离群值 - Finding outliers further than certain standard deviations from mean for a data frame in r 如何从R中的数据框中删除异常值? - How to remove outliers from data frame in R? 如何使用 dplyr package 计算数据框中列的均值和标准差? - How can I calculate the mean and standard deviation of a column in a data frame using the dplyr package? 提取和映射 geoTIFF 数据 R 中与平均值的 1+ 标准差 - Extracting and mapping geoTIFF data 1+ standard deviations from mean in R 将 function 应用于因子(参与者)的每个级别,以根据 R 中标准差中的均值距离去除异常值 - Apply function to each level of a factor (participant) to remove outliers based on distance from mean in standard deviation in R 找出一个参数从 0 开始有多少个标准差 R - Find out how many standard deviations a parameters mean is from 0 R 如何从 data.frame 中删除多个异常值 - How to remove multiple outliers from a data.frame 如何在R的数据框中的列中仅获取某些行的标准差? - How can I take standard deviations of only certain rows within a column in a dataframe in R? 如何根据字符列获取均值和标准差数值数据? - How to get means and standard deviations numeric data based on a character column? 从R中的数据框中删除异常值? - Remove outliers from data frame in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM