[英]Removing outliers from groups using data.table in R
I have a data.table object that contains group column. 我有一个包含组列的data.table对象。 I am trying to remove outliers from each of the groups, however I cannot come up with the nice solution for that.
我正在尝试从每个组中删除离群值,但是我无法为此提出一个不错的解决方案。 My data.table can be build using simple script:
我的data.table可以使用简单的脚本构建:
col1 <- rnorm(30, mean = 5, sd = 2)
col2 <- rnorm(30, mean = 5, sd = 2)
id <- seq(1, 30)
group <- sample(4, 30, replace = TRUE)
dt <- data.table(id, group, col1, col2)
I've been trying to split data.frame by group variable, however, it's too messy approach. 我一直在尝试通过组变量拆分data.frame,但是,这太乱了。 How would I "easily" remove top n% of outliers from each group in data.table without having too many data transformations?
如何在没有太多数据转换的情况下“轻松”地从data.table中的每个组中删除前n%个离群值?
Assuming that you want to remove outliers according to both col1
and col2
, based on the 95% quantile: 假设您要根据
col1
和col2
基于95%的分位数删除异常值:
dt_filt <- dt[,
.SD[
((col1 < quantile(col1, probs = 0.95)) &
(col2 < quantile(col2, probs = 0.95)))
], by = group
]
which basically splits the data based on the group
column, calculates the thresholds, and then subsets the data to keep only rows where col1
and col2
are lower than the thresholds. 它基本上根据
group
列拆分数据,计算阈值,然后对数据进行子集以仅保留col1
和col2
低于阈值的行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.