简体   繁体   中英

Removing outliers from groups using data.table in R

I have a data.table object that contains group column. I am trying to remove outliers from each of the groups, however I cannot come up with the nice solution for that. My data.table can be build using simple script:

col1 <- rnorm(30, mean = 5, sd = 2)
col2 <- rnorm(30, mean = 5, sd = 2)
id <- seq(1, 30)
group <- sample(4, 30, replace = TRUE)
dt <- data.table(id, group, col1, col2)

I've been trying to split data.frame by group variable, however, it's too messy approach. How would I "easily" remove top n% of outliers from each group in data.table without having too many data transformations?

Assuming that you want to remove outliers according to both col1 and col2 , based on the 95% quantile:

dt_filt <- dt[, 
    .SD[
        ((col1 < quantile(col1, probs = 0.95)) & 
         (col2 < quantile(col2, probs = 0.95)))
    ], by = group
]

which basically splits the data based on the group column, calculates the thresholds, and then subsets the data to keep only rows where col1 and col2 are lower than the thresholds.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM