简体   繁体   中英

detect outliers in a group and outlier in the single data

Car    100 200 300
Group1  34  35  34
Group1  57  67  34
Group1  68  76  6
Group2  45  23  23

I have some problems while detecting outliers in my dataframe. I want to detect if there is a complete vector (one row) an outlier of the corresponding group vectors (rows one-three)for each group. Further i want to detect if there is an outlier in one specific row. For this problem i found this solution but with this code i have to repeat the whole code for every single row and check the table for an "TRUE". Is there an outomatisation possible? eg creating a matrix of all outputs so i just have to check >sum(matrix==TRUE)

The code:

x=as.numeric(data_without[1,1:400])
grubbs.flag <- function(x) {
     outliers <- NULL
     test <- x
     grubbs.result <- grubbs.test(test)
     pv <- grubbs.result$p.value
     while(pv < 0.05) {
         outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3]))
         test <- x[!x %in% outliers]
         grubbs.result <- grubbs.test(test)
         pv <- grubbs.result$p.value
     }
     return(data.frame(X=x,Outlier=(x %in% outliers)))
 }

grubbs.flag(x)
         X Outlier
1   0.1157   FALSE
2   0.1152   FALSE
3   0.1163   FALSE
4   0.1165   FALSE

I've read the object documentation and the default option just checks if there is a single outlier given data. Therefore I consider it suffices to run the test only once per each group.

First the data is split by group and then test is done recursively for each group. Only p-value and description is returned at the end to see which is the outlier if any - it'd be easy to identify which is the outlier as it'll be either the maximum or minimum value.

library(outliers)
df <- t(data.frame(car = c(100,200,300),
                 g1 = c(34,35,34),
                 g1 = c(57,67,34),
                 g1 = c(68, 76, 6),
                 g2 = c(45, 23, 23)))
row.names(df) <- c("car", "group1", "group1", "group1", "group2")

lst <- lapply(1:length(unique(row.names(df))), function(x) {
  df[row.names(df)==unique(row.names(df))[x],]
})

lst
[[1]]
[1] 100 200 300

[[2]]
[,1] [,2] [,3]
group1   34   35   34
group1   57   67   34
group1   68   76    6

[[3]]
[1] 45 23 23

lapply(lst, function(x) {
  tst <- grubbs.test(x)
  c(tst$p.value, tst$alternative)
})
[[1]]
[1] "0.5"                             "highest value 300 is an outlier"

[[2]]
[1] "0.244875529263511"            "lowest value 6 is an outlier"

[[3]]
[1] "0"                              "highest value 45 is an outlier"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM