使用R中的data.table从组中删除异常值

Question

I have a data.table object that contains group column. 我有一个包含组列的data.table对象。 I am trying to remove outliers from each of the groups, however I cannot come up with the nice solution for that. 我正在尝试从每个组中删除离群值，但是我无法为此提出一个不错的解决方案。 My data.table can be build using simple script: 我的data.table可以使用简单的脚本构建：

col1 <- rnorm(30, mean = 5, sd = 2)
col2 <- rnorm(30, mean = 5, sd = 2)
id <- seq(1, 30)
group <- sample(4, 30, replace = TRUE)
dt <- data.table(id, group, col1, col2)

I've been trying to split data.frame by group variable, however, it's too messy approach. 我一直在尝试通过组变量拆分data.frame，但是，这太乱了。 How would I "easily" remove top n% of outliers from each group in data.table without having too many data transformations? 如何在没有太多数据转换的情况下“轻松”地从data.table中的每个组中删除前n％个离群值？

Answer 1

Assuming that you want to remove outliers according to both col1 and col2 , based on the 95% quantile: 假设您要根据col1和col2基于95％的分位数删除异常值：

dt_filt <- dt[, 
    .SD[
        ((col1 < quantile(col1, probs = 0.95)) & 
         (col2 < quantile(col2, probs = 0.95)))
    ], by = group
]

which basically splits the data based on the group column, calculates the thresholds, and then subsets the data to keep only rows where col1 and col2 are lower than the thresholds. 它基本上根据group列拆分数据，计算阈值，然后对数据进行子集以仅保留col1和col2低于阈值的行。

使用R中的data.table从组中删除异常值

问题描述

1 个解决方案

解决方案1
6 已采纳 2015-10-21 11:58:14

使用R中的data.table从组中删除异常值

问题描述

1 个解决方案

解决方案1 6 已采纳 2015-10-21 11:58:14

解决方案1
6 已采纳 2015-10-21 11:58:14