简体   繁体   English

使用R中的data.table从组中删除异常值

[英]Removing outliers from groups using data.table in R

I have a data.table object that contains group column. 我有一个包含组列的data.table对象。 I am trying to remove outliers from each of the groups, however I cannot come up with the nice solution for that. 我正在尝试从每个组中删除离群值,但是我无法为此提出一个不错的解决方案。 My data.table can be build using simple script: 我的data.table可以使用简单的脚本构建:

col1 <- rnorm(30, mean = 5, sd = 2)
col2 <- rnorm(30, mean = 5, sd = 2)
id <- seq(1, 30)
group <- sample(4, 30, replace = TRUE)
dt <- data.table(id, group, col1, col2)

I've been trying to split data.frame by group variable, however, it's too messy approach. 我一直在尝试通过组变量拆分data.frame,但是,这太乱了。 How would I "easily" remove top n% of outliers from each group in data.table without having too many data transformations? 如何在没有太多数据转换的情况下“轻松”地从data.table中的每个组中删除前n%个离群值?

Assuming that you want to remove outliers according to both col1 and col2 , based on the 95% quantile: 假设您要根据col1col2基于95%的分位数删除异常值:

dt_filt <- dt[, 
    .SD[
        ((col1 < quantile(col1, probs = 0.95)) & 
         (col2 < quantile(col2, probs = 0.95)))
    ], by = group
]

which basically splits the data based on the group column, calculates the thresholds, and then subsets the data to keep only rows where col1 and col2 are lower than the thresholds. 它基本上根据group列拆分数据,计算阈值,然后对数据进行子集以仅保留col1col2低于阈值的行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM