R - 一次替换和删除 dataframe 或多个列中的第一个和最后一个百分位数

Question

I have this dataset:我有这个数据集：

A <- paste0("event_", c(1:100))
some_number <- sample.int(1000,size=100) 
X1 <- c(1:100)
X2 <- c(101:200)
X3 <- c(201:300)
X4 <- c(301:400)
X5 <- c(401:500)
DF <- data.frame(A, some_number, X1, X2, X3, X4, X5)

As I'm treating outliers, I'm looking to delete the rows that contains the 1th and the latest percentile, considering only the X variables for the percentile computation and all X variables as ONE group.在处理异常值时，我希望删除包含第 1 个和最新百分位数的行，仅考虑用于百分位数计算的X变量和所有X变量作为一个组。 Hence, the percentiles will consider X1 to X5 as ONE group.因此，百分位数将X1到X5视为 ONE 组。 For this it occurs to me these steps:为此，我想到了以下步骤：

Replace the values of X1 to X5 with 1 to 100 (1 for each percentile).将X1到X5的值替换为 1 到 100（每个百分位数为 1）。 Remember, I'm not looking for the percentiles of each X , but for all X's as a whole.请记住，我不是在寻找每个X的百分位数，而是寻找所有 X 的整体。
Delete the rows where the variables X1 to X5 contains 1 or 100删除变量X1到X5包含 1 或 100 的行

My attempt: (based on how to find percentiles , replace outliers with the 5th and 95th percentile , remove data greater than 95th percentile in data frame )我的尝试：（基于如何找到百分位数，用第 5 和第 95 个百分位数替换异常值，删除数据框中大于第 95 个百分位数的数据）

as.data.frame(sapply(select(DF, X1:X5), function (x) {
     qx <- quantile(x, probs = c(1:100)/100)
     cut(x, qx, labels = c(1:100))
}))

But.. my attempt raises the error that the number of breaks is different to the number of labels, I'm struggling to assign the new dataframe without losing A and some_number variables (in my real problem they are not two columns, but nearly 50)但是..我的尝试引发了中断数与标签数不同的错误，我正在努力分配新的 dataframe 而不会丢失A和some_number变量（在我的实际问题中，它们不是两列，而是近 50 )

Any suggestions?有什么建议么？

Answer 1

Using both across and c_across in dplyr , you may also do this- across dplyr c_across您也可以这样做 -

Steps explained -步骤说明 -

c_across is usually used with row_wise as it creates a complete copy of data subsetted through its inner argument. c_across通常与row_wise使用，因为它创建了通过其内部参数子集化的数据的完整副本。 But I have done it without rowwise() so instead of creating one row it is creating a copy of whole data as desired.但是我在没有rowwise()的情况下完成了它，因此它不是创建一行而是根据需要创建整个数据的副本。
thereafter two quantiles of this data will be deduced.此后将推导出该数据的两个分位数。 (which will be scalar quantities) （这将是标量）
Now only job remains is to to check these values with every other value in data.现在唯一剩下的工作就是将这些值与数据中的所有其他值进行检查。 So I used here across directly.所以我这里直接across 。
Using across I built a lambda formula which starts with a twiddle and its argument is .使用整个我建立了一个 lambda 公式，它以一个twiddle开始，它的参数是. only.只要。 This twiddle style formula ~.这玩转式的公式~. is equivalent to function(x) x and the rest is clear.相当于function(x) x并且 rest 是明确的。

DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
                                                 . < quantile(c_across(starts_with('X')), 0.01),
                                               NA, .) 
                     )) %>% na.omit()

#>           A some_number X1  X2  X3  X4  X5
#> 6   event_6          69  6 106 206 306 406
#> 7   event_7         871  7 107 207 307 407
#> 8   event_8         356  8 108 208 308 408
.
.
.
#> 93 event_93         432 93 193 293 393 493
#> 94 event_94         967 94 194 294 394 494
#> 95 event_95         516 95 195 295 395 495

Since starts_with works only in across or c_across and to avoid slower rowwise here, we can also do this directly由于starts_with仅适用across cross 或c_across并且为了避免此处的rowwise较慢，我们也可以直接执行此操作

DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
                rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)

This will also give 90 rows in output as desired这也将根据需要在 output 中提供 90 行

Answer 2

You can try the following -您可以尝试以下方法 -

library(dplyr)
vec <- DF %>% select(starts_with('X')) %>% as.matrix() %>% quantile(c(0.01, 0.99))

DF %>% filter(if_all(starts_with('X'), ~. > vec[1] & . < vec[2]))

R - 一次替换和删除 dataframe 或多个列中的第一个和最后一个百分位数

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-06-09 04:37:54

解决方案2
1 2021-06-09 03:41:31

R - 一次替换和删除 dataframe 或多个列中的第一个和最后一个百分位数

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-06-09 04:37:54

解决方案2 1 2021-06-09 03:41:31

解决方案1
3 已采纳 2021-06-09 04:37:54

解决方案2
1 2021-06-09 03:41:31