簡體   English   中英

檢測具有不同組長度的分組 data.frame 列中的異常值

[英]Detect outliers in grouped data.frame columns with varying group length

我正在嘗試平均 R 中沉積物粒度等級的測量數據。我通常每個樣品有 8 次測量,但有時設備失敗或測量不可靠(例如,由於 3 次測量后的機械問題等)。 因此,此測量值看起來與同一樣本的其余部分非常不同,因此我認為異常值測試可以自動刪除它們。

我嘗試使用dplyr遵循此代碼:

https://www.r-bloggers.com/combined-outlier-detection-with-dplyr-and-ruler/

但似乎不是所有組的長度都相同的問題。 我還發現了這種針對向量的方法

https://stackoverflow.com/a/4788102/8321705

但我不知道如何事先對我的數據進行分組,然后將其應用於每組的每一列。

這是我的一個數據頭,少數樣本只有3次重復測量。 前 3 列數字以一種方式描述粒子分數,后 6 列以另一種方式描述:

#my data with unequal group length
test <- structure(list(Sample_name = c("Sediment1140", "Sediment1140", 
"Sediment1140", "Sediment1140", "Sediment1140", "Sediment1140", 
"Sediment1140", "Sediment1140", "Sediment1141", "Sediment1141", 
"Sediment1141", "Sediment1141", "Sediment1141", "Sediment1141", 
"Sediment1141", "Sediment1141", "Sediment1142", "Sediment1142", 
"Sediment1142", "Sediment1142"), Dx_10_percent = c(228.3413627, 
232.9637155, 236.4058197, 235.4124387, 238.2260309, 238.983854, 
237.0509773, 234.22402, 245.5622443, 247.1072046, 248.7302949, 
247.7311716, 253.6328878, 249.7883614, 245.8217667, 247.047291, 
183.5981354, 186.4353531, 184.4024079, 183.9496282), Dx_50_percent = c(464.4559559, 
470.4392019, 479.0066087, 474.75933, 478.1515348, 481.8823096, 
480.3117339, 476.2827332, 442.6699831, 443.3890093, 442.4344575, 
435.0531805, 443.4543899, 447.494161, 434.7639443, 433.3472111, 
336.6085081, 340.9353695, 336.0106474, 340.0936298), Dx_90_percent = c(854.6392436, 
856.8504381, 879.5524457, 880.468129, 858.3297603, 905.0097741, 
879.5146896, 873.8584305, 819.4818726, 816.5296812, 778.9013718, 
766.7617116, 770.5702479, 829.0866972, 766.083991, 751.9915196, 
656.4105245, 664.7034131, 698.6157344, 718.225128), Microm_001 = c(0.797059348, 
0.801571015, 0.734207569, 0.841152063, 0.834553976, 0.75429789, 
0.831636299, 1.000633239, 0.713401217, 0.74354612, 0.741372753, 
0.841747801, 0.727424775, 0.755804532, 1.163420288, 1.081749441, 
0.27579194, 0.483475909, 0.555629788, 0.697398689), Microm_63 = c(0.90472056, 
0.944959738, 0.94659555, 0.903114644, 0.96501726, 0.91079578, 
0.954569594, 0.987258593, 0.000822487, 0.000571334, 0.000442701, 
0.000297749, 0, 0.000259136, 0.000891928, 0.000769421, 0.923900573, 
0.793744127, 0.809342888, 0.839719189), Microm_125 = c(11.30751247, 
10.58007103, 10.18149105, 10.27507954, 9.833963719, 9.901098909, 
9.983293752, 10.11735892, 10.0938523, 9.776308321, 9.483238809, 
9.57495155, 8.647488515, 9.280930949, 9.601526818, 9.458248991, 
26.14106339, 25.04179051, 25.88123946, 25.37747955), Microm_250 = c(42.63011079, 
42.40400703, 41.42517791, 41.94288617, 41.87864039, 41.23829484, 
41.31751454, 41.63495869, 49.13692679, 49.3879579, 50.23445206, 
51.60666791, 51.1727376, 49.11877239, 51.26674279, 52.00409162, 
50.48687475, 50.90777981, 50.23482884, 49.24764135), Microm_500 = c(39.88402392, 
40.77838926, 41.50718101, 40.65764848, 42.15199068, 40.83119895, 
41.73250567, 41.18885416, 35.87934642, 36.00915655, 37.3180857, 
35.62608032, 37.41951083, 36.26828346, 35.6623117, 35.71836133, 
20.18297131, 20.1836753, 17.33407897, 19.10041196), Microm_1000 = c(4.476572917, 
4.49100193, 5.205346918, 5.380119103, 4.335833973, 6.364313631, 
5.180480148, 5.070936398, 4.175650784, 4.082459775, 2.222407984, 
2.350254673, 2.032838283, 4.575949528, 2.305106476, 1.736779196, 
1.98939803, 2.589534343, 5.184880055, 4.737349268)), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 19L, 20L, 21L, 22L), class = "data.frame")

我的偽代碼如下所示:

#define function from SO answer
remove_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

# using dplyr
out_test <- test %>%
  group_by(Sample_name) %>%
  apply(2, remove_outliers)

# using base R by
out_test2 <- by(test, test$Sample_name, remove_outliers)

如何標記/檢測與同一樣本的平行線顯着不同的行或直接刪除它們?

哦,還有一個額外的問題:從統計的角度來看,8 個樣本是否足以確定異常值? 在我的情況下,這是由於測量失敗導致的極端情況,但沒有其他人會被誤導。

在排除分組變量后使用mutate_at

library(dplyr)
test %>%
     group_by(Sample_name) %>% 
     mutate_at(vars(-group_cols()), remove_outliers)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM