[英]R - replace and delete first and last percentile in dataframe or multiple columns at once
I have this dataset:我有这个数据集:
A <- paste0("event_", c(1:100))
some_number <- sample.int(1000,size=100)
X1 <- c(1:100)
X2 <- c(101:200)
X3 <- c(201:300)
X4 <- c(301:400)
X5 <- c(401:500)
DF <- data.frame(A, some_number, X1, X2, X3, X4, X5)
As I'm treating outliers, I'm looking to delete the rows that contains the 1th and the latest percentile, considering only the X
variables for the percentile computation and all X
variables as ONE group.在处理异常值时,我希望删除包含第 1 个和最新百分位数的行,仅考虑用于百分位数计算的
X
变量和所有X
变量作为一个组。 Hence, the percentiles will consider X1
to X5
as ONE group.因此,百分位数将
X1
到X5
视为 ONE 组。 For this it occurs to me these steps:为此,我想到了以下步骤:
X1
to X5
with 1 to 100 (1 for each percentile).X1
到X5
的值替换为 1 到 100(每个百分位数为 1)。 Remember, I'm not looking for the percentiles of each X
, but for all X's as a whole.X
的百分位数,而是寻找所有 X 的整体。X1
to X5
contains 1 or 100X1
到X5
包含 1 或 100 的行My attempt: (based on how to find percentiles , replace outliers with the 5th and 95th percentile , remove data greater than 95th percentile in data frame )我的尝试:(基于如何找到百分位数, 用第 5 和第 95 个百分位数替换异常值, 删除数据框中大于第 95 个百分位数的数据)
as.data.frame(sapply(select(DF, X1:X5), function (x) {
qx <- quantile(x, probs = c(1:100)/100)
cut(x, qx, labels = c(1:100))
}))
But.. my attempt raises the error that the number of breaks is different to the number of labels, I'm struggling to assign the new dataframe without losing A
and some_number
variables (in my real problem they are not two columns, but nearly 50)但是..我的尝试引发了中断数与标签数不同的错误,我正在努力分配新的 dataframe 而不会丢失
A
和some_number
变量(在我的实际问题中,它们不是两列,而是近 50 )
Any suggestions?有什么建议么?
Using both across
and c_across
in dplyr
, you may also do this- across
dplyr
c_across
您也可以这样做 -
Steps explained -步骤说明 -
c_across
is usually used with row_wise
as it creates a complete copy of data subsetted through its inner argument. c_across
通常与row_wise
使用,因为它创建了通过其内部参数子集化的数据的完整副本。 But I have done it without rowwise()
so instead of creating one row it is creating a copy of whole data as desired.rowwise()
的情况下完成了它,因此它不是创建一行而是根据需要创建整个数据的副本。across
directly.across
。twiddle
and its argument is .
twiddle
开始,它的参数是.
only.~.
~.
is equivalent to function(x) x
and the rest is clear.function(x) x
并且 rest 是明确的。DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
. < quantile(c_across(starts_with('X')), 0.01),
NA, .)
)) %>% na.omit()
#> A some_number X1 X2 X3 X4 X5
#> 6 event_6 69 6 106 206 306 406
#> 7 event_7 871 7 107 207 307 407
#> 8 event_8 356 8 108 208 308 408
.
.
.
#> 93 event_93 432 93 193 293 393 493
#> 94 event_94 967 94 194 294 394 494
#> 95 event_95 516 95 195 295 395 495
Since starts_with
works only in across
or c_across
and to avoid slower rowwise
here, we can also do this directly由于
starts_with
仅适用across
cross 或c_across
并且为了避免此处的rowwise
较慢,我们也可以直接执行此操作
DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)
This will also give 90 rows in output as desired这也将根据需要在 output 中提供 90 行
You can try the following -您可以尝试以下方法 -
library(dplyr)
vec <- DF %>% select(starts_with('X')) %>% as.matrix() %>% quantile(c(0.01, 0.99))
DF %>% filter(if_all(starts_with('X'), ~. > vec[1] & . < vec[2]))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.