Removing outliers from groups using data.table in R

Question

I have a data.table object that contains group column. I am trying to remove outliers from each of the groups, however I cannot come up with the nice solution for that. My data.table can be build using simple script:

col1 <- rnorm(30, mean = 5, sd = 2)
col2 <- rnorm(30, mean = 5, sd = 2)
id <- seq(1, 30)
group <- sample(4, 30, replace = TRUE)
dt <- data.table(id, group, col1, col2)

I've been trying to split data.frame by group variable, however, it's too messy approach. How would I "easily" remove top n% of outliers from each group in data.table without having too many data transformations?

Answer 1

Assuming that you want to remove outliers according to both col1 and col2 , based on the 95% quantile:

dt_filt <- dt[, 
    .SD[
        ((col1 < quantile(col1, probs = 0.95)) & 
         (col2 < quantile(col2, probs = 0.95)))
    ], by = group
]

which basically splits the data based on the group column, calculates the thresholds, and then subsets the data to keep only rows where col1 and col2 are lower than the thresholds.

Removing outliers from groups using data.table in R

Question

1 answers

solution1
6 ACCPTED 2015-10-21 11:58:14

Removing outliers from groups using data.table in R

Question

1 answers

solution1 6 ACCPTED 2015-10-21 11:58:14

solution1
6 ACCPTED 2015-10-21 11:58:14