简体   繁体   English

R - 一次替换和删除 dataframe 或多个列中的第一个和最后一个百分位数

[英]R - replace and delete first and last percentile in dataframe or multiple columns at once

I have this dataset:我有这个数据集:

A <- paste0("event_", c(1:100))
some_number <- sample.int(1000,size=100) 
X1 <- c(1:100)
X2 <- c(101:200)
X3 <- c(201:300)
X4 <- c(301:400)
X5 <- c(401:500)
DF <- data.frame(A, some_number, X1, X2, X3, X4, X5)

As I'm treating outliers, I'm looking to delete the rows that contains the 1th and the latest percentile, considering only the X variables for the percentile computation and all X variables as ONE group.在处理异常值时,我希望删除包含第 1 个和最新百分位数的行,仅考虑用于百分位数计算的X变量和所有X变量作为一个组。 Hence, the percentiles will consider X1 to X5 as ONE group.因此,百分位数将X1X5视为 ONE 组。 For this it occurs to me these steps:为此,我想到了以下步骤:

  1. Replace the values of X1 to X5 with 1 to 100 (1 for each percentile).X1X5的值替换为 1 到 100(每个百分位数为 1)。 Remember, I'm not looking for the percentiles of each X , but for all X's as a whole.请记住,我不是在寻找每个X的百分位数,而是寻找所有 X 的整体。
  2. Delete the rows where the variables X1 to X5 contains 1 or 100删除变量X1X5包含 1 或 100 的行

My attempt: (based on how to find percentiles , replace outliers with the 5th and 95th percentile , remove data greater than 95th percentile in data frame )我的尝试:(基于如何找到百分位数用第 5 和第 95 个百分位数替换异常值删除数据框中大于第 95 个百分位数的数据

as.data.frame(sapply(select(DF, X1:X5), function (x) {
     qx <- quantile(x, probs = c(1:100)/100)
     cut(x, qx, labels = c(1:100))
}))

But.. my attempt raises the error that the number of breaks is different to the number of labels, I'm struggling to assign the new dataframe without losing A and some_number variables (in my real problem they are not two columns, but nearly 50)但是..我的尝试引发了中断数与标签数不同的错误,我正在努力分配新的 dataframe 而不会丢失Asome_number变量(在我的实际问题中,它们不是两列,而是近 50 )

Any suggestions?有什么建议么?

Using both across and c_across in dplyr , you may also do this- across dplyr c_across您也可以这样做 -

Steps explained -步骤说明 -

  • c_across is usually used with row_wise as it creates a complete copy of data subsetted through its inner argument. c_across通常与row_wise使用,因为它创建了通过其内部参数子集化的数据的完整副本。 But I have done it without rowwise() so instead of creating one row it is creating a copy of whole data as desired.但是我在没有rowwise()的情况下完成了它,因此它不是创建一行而是根据需要创建整个数据的副本。
  • thereafter two quantiles of this data will be deduced.此后将推导出该数据的两个分位数。 (which will be scalar quantities) (这将是标量)
  • Now only job remains is to to check these values with every other value in data.现在唯一剩下的工作就是将这些值与数据中的所有其他值进行检查。 So I used here across directly.所以我这里直接across
  • Using across I built a lambda formula which starts with a twiddle and its argument is .使用整个我建立了一个 lambda 公式,它以一个twiddle开始,它的参数是. only.只要。 This twiddle style formula ~.这玩转式的公式~. is equivalent to function(x) x and the rest is clear.相当于function(x) x并且 rest 是明确的。
DF %>% mutate(across(starts_with('X'), ~ifelse(. > quantile(c_across(starts_with('X')), 0.99) |
                                                 . < quantile(c_across(starts_with('X')), 0.01),
                                               NA, .) 
                     )) %>% na.omit()

#>           A some_number X1  X2  X3  X4  X5
#> 6   event_6          69  6 106 206 306 406
#> 7   event_7         871  7 107 207 307 407
#> 8   event_8         356  8 108 208 308 408
.
.
.
#> 93 event_93         432 93 193 293 393 493
#> 94 event_94         967 94 194 294 394 494
#> 95 event_95         516 95 195 295 395 495

Since starts_with works only in across or c_across and to avoid slower rowwise here, we can also do this directly由于starts_with仅适用across cross 或c_across并且为了避免此处的rowwise较慢,我们也可以直接执行此操作

DF %>% filter(rowSums(cur_data()[str_detect(names(DF), 'X')] > quantile(c_across(starts_with('X')), 0.99)) == 0 &
                rowSums(cur_data()[str_detect(names(DF), 'X')] < quantile(c_across(starts_with('X')), 0.01)) == 0)

This will also give 90 rows in output as desired这也将根据需要在 output 中提供 90 行

You can try the following -您可以尝试以下方法 -

library(dplyr)
vec <- DF %>% select(starts_with('X')) %>% as.matrix() %>% quantile(c(0.01, 0.99))

DF %>% filter(if_all(starts_with('X'), ~. > vec[1] & . < vec[2]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM