[英]How can I remove outliers (numbers 3 standard deviations away from the mean) in each column of a data frame
I have a dataset with participant IDS and 17 different measures for each participant.我有一个包含参与者 IDS 的数据集和每个参与者的 17 种不同度量。
I need to remove outliers- numbers that are 3 standard deviations away from the mean both sides.我需要删除离均值 3 个标准差的异常值。 This needs to happen for each column individually.这需要为每一列单独发生。
So far, by using the code below I have managed to add NA to an outlier column for each column, but it doesn't help me much, since I need to be able to either add NA to the column with the rest of the numbers or simply remove the outlier number到目前为止,通过使用下面的代码,我已经设法将 NA 添加到每列的异常值列中,但这对我帮助不大,因为我需要能够将 NA 添加到带有其余数字的列中或者简单地删除异常值
Ideally I want to get to a file that looks like this:理想情况下,我想获得一个如下所示的文件:
ID measure1 measure2 ....measure17
1 10897 64436
2 184658 1739473
3 75758
4 746483 4327349 3612638
5 6444 36363 46447
Code I have used so far:到目前为止我使用过的代码:
phenotypes <- colnames(imaging_data_kept[,2:ncol(imaging_data_kept)])
for (i in phenotypes){
Min <- mean(imaging_data_kept[[i]]) - (3*sd(imaging_data_kept[[i]]))
Max <- mean(imaging_data_kept[[i]]) + (3*sd(imaging_data_kept[[i]]))
imaging_data_kept[[paste0(i,"_outliers")]] <- imaging_data_kept[[i]] <
Min | imaging_data_kept[[i]] > Max
}
Sample data:样本数据:
SubjID M1 M2 M3 M4 M5
1000496 14898.1 9172 4902 5921.9 1428.2
1001121 5420.7 2855.5 4144 732.1 4960.2
1001468 7478.8 3401.4 5143.6 1106.5 4355.5
1004960 11316.4 8460.1 3953.4 5682.2 1717
1005040 15052.7 6362.8 3145.2 4593 1214.5
1005677 17883.3 6705.1 3943.5 4993.1 1373.1
1006128 6260.8 4274.6 5865 2002.3 4727.1
1006694 9292.8 3389.9 5141.6 1246.6 4135.7
1009080 10391.3 8372.1 2921.8 4008.6 860.4
1010482 9381.5 2743.4 4526.5 1160.4 3655.1
1011508 15598.5 7365.7 4279.4 6274.1 1757.1
This will replace values more than 3 SD from the mean with NA:这将用 NA 替换平均值超过 3 SD 的值:
dd[,-1] <- lapply(dd[,-1],
function(x) replace(x,abs(scale(x))>3,NA))
(The scale()
function computes (x-mean(x))/sd(x)
; abs(scale(x))>3
should be reasonably self-explanatory; replace()
replaces a specified set of indices with the indicated value.) ( scale()
函数计算(x-mean(x))/sd(x)
; abs(scale(x))>3
应该是合理的不言自明; replace()
用指定的值替换指定的一组索引.)
You can then use na.omit(dd)
if you want to drop all rows that contain outliers in any column.如果要删除任何列中包含异常值的所有行,则可以使用na.omit(dd)
。
The sample data you gave us doesn't appear to have any outliers (according to your definition) -- I added some.您提供给我们的样本数据似乎没有任何异常值(根据您的定义)——我添加了一些。
dd <- read.table(header=TRUE,
colClasses=c("character",rep("numeric",5)),
text="
SubjID M1 M2 M3 M4 M5
1000496 14898.1 9172 4902 5921.9 1428.2
1001121 5420.7 2855.5 4144 732.1 100000
1001468 7478.8 3401.4 5143.6 1106.5 4355.5
1004960 11316.4 8460.1 3953.4 5682.2 1717
1005040 15052.7 6362.8 3145.2 4593 1214.5
1005677 17883.3 6705.1 100000 4993.1 1373.1
1006128 6260.8 4274.6 5865 2002.3 4727.1
1006694 9292.8 3389.9 5141.6 1246.6 4135.7
1009080 10391.3 8372.1 2921.8 4008.6 860.4
1010482 9381.5 2743.4 4526.5 1000000 3655.1
1011508 15598.5 7365.7 4279.4 6274.1 1757.1
")
I recommend using the boxplot()
- function, which calculates outliers.我建议使用boxplot()
- 函数,它计算异常值。 You can acces them in your boxplot
-object via boxplot$out
or get the quantiles via boxplot$stats
.您可以通过boxplot$out
在您的boxplot
-object 中访问它们或通过boxplot$stats
获取分位数。 Which is what I'm doing next.这就是我接下来要做的。
But beware that boxplot does not calculate outliers in terms of 3 standard deviations but with Q1 - 1.5*IQR
and Q3 + 1.5*IQR
respectively.但请注意,箱线图不会根据 3 个标准差计算异常值,而是分别使用Q1 - 1.5*IQR
和Q3 + 1.5*IQR
。
library(dplyr) # for the pipe operators
#creating sample data
df <- data.frame("var1" = c(-20.32, -15.29, rnorm(5,1,1), 11.23, 20.45),
"var2" = c(-12.43, -3.12, rnorm(5, 1,1), 10.75, 18.11))
#looks like that
> df
var1 var2
1 -20.3200000 -12.4300000
2 -15.2900000 -3.1200000
3 0.9950276 1.2645415
4 1.7022687 0.8313770
5 1.8828154 -0.7459769
6 1.2299670 0.5053378
7 0.2749259 2.0239793
8 11.2300000 10.7500000
9 20.4500000 18.1100000
#remove outliers
nooutliers <- lapply(df, function(x) boxplot(df, plot = FALSE)) %>%
lapply(`[`, "stats") %>%
lapply(range) %>%
mapply(function (x,y) !between(x, y[1], y[2]), df, .) %>%
as.data.frame %>%
mapply(function(x,y) {y[x] <- NA; y},
y = df, x = .)
#looks like this now
> nooutliers
var1 var2
[1,] NA NA
[2,] NA -3.1200000
[3,] 0.9950276 1.2645415
[4,] 1.7022687 0.8313770
[5,] 1.8828154 -0.7459769
[6,] 1.2299670 0.5053378
[7,] 0.2749259 2.0239793
[8,] NA NA
[9,] NA NA
This code calculates the range within the whiskers for each column, assigns NA
to all values outside of this range and returns a matrix.此代码计算每列胡须内的范围,将NA
分配给此范围之外的所有值并返回一个矩阵。
I suppose this is what you're looking for.我想这就是你要找的。
UPDATE: With 3 standard deviations:更新:有 3 个标准差:
df <- data.frame("var1" = c(-210.32, rnorm(20,1,1), 234.45),
"var2" = c(-230.43, rnorm(20, 1,1), 213.11))
phenotypes <- colnames(df)
for (i in phenotypes){
Min <- mean(df[[i]]) - (3*sd(df[[i]]))
Max <- mean(df[[i]]) + (3*sd(df[[i]]))
df[[i]][df[[i]] < Min | df[[i]] > Max] <- NA}
This adopts your outlier definition.这采用了您的异常值定义。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.