简体   繁体   English

用NA代替均值中的突出值

[英]Replace outstanding values from the mean by NA

I would like to take a mean of each row from my data and find out how far from the mean is each value in the row. 我想从数据中获取每一行的平均值,并找出行中每个值与平均值之间的距离。 If the percentage is higher than 50 this value should be replaced with NA . 如果百分比高于50,则此值应替换为NA

That's the data: 那就是数据:

structure(list(Name = structure(c(18L, 19L, 5L, 13L, 14L, 31L
), .Label = c("AMC Javelin", "Cadillac Fleetwood", "Camaro Z28", 
"Chrysler Imperial", "Datsun 710", "Dodge Challenger", "Duster 360", 
"Ferrari Dino", "Fiat 128", "Fiat X1-9", "Ford Pantera L", "Honda Civic", 
"Hornet 4 Drive", "Hornet Sportabout", "Lincoln Continental", 
"Lotus Europa", "Maserati Bora", "Mazda RX4", "Mazda RX4 Wag", 
"Merc 230", "Merc 240D", "Merc 280", "Merc 280C", "Merc 450SE", 
"Merc 450SL", "Merc 450SLC", "Pontiac Firebird", "Porsche 914-2", 
"Toyota Corolla", "Toyota Corona", "Valiant", "Volvo 142E"), class = "factor"), 
    mpg_1 = c(125, 133, 143, 141, 134, 238), cyl_1 = c(114, 153, 
    112, 136, 128, 155), disp_1 = c(113, 143, 144, 131, 431, 
    331), hp_1 = c(332, 221, 113, 331, 134, 151)), .Names = c("Name", 
"mpg_1", "cyl_1", "disp_1", "hp_1"), row.names = c(NA, 6L), class = "data.frame")

and that's the desired output: 这是所需的输出:

               Name mpg_1 cyl_1 disp_1 hp_1
1         Mazda RX4   125   114    113  NA
2     Mazda RX4 Wag   133   153    143  221
3        Datsun 710   143   112    144  113
4    Hornet 4 Drive   141   136    131  NA
5 Hornet Sportabout   134   128    NA   134
6           Valiant   238   155    331  151

There are two conditions as well. 也有两个条件。

  1. The only one outstanding value from the row can be replaced with NA . 该行中唯一一个未完成的值可以替换为NA It's hard to believe that using 50% cutoff there will be two values because the mean would change completely but look at the second condition. 很难相信使用50%截止值会有两个值,因为均值会完全改变,但要看第二个条件。
  2. Would be great if the cutoff percentage would be easy to modify. 如果截止百分比很容易修改,那就太好了。 I make go lower than 50%. 我的收入低于50%。

Do you have any idea how to do it in efficient way ? 您是否知道如何以有效的方式进行操作? Using a loop it looks doable but maybe there is more efficient way? 使用循环看起来可行,但是也许有更有效的方法?

From a statistical point view, as @Roland mentions in comments, this is not advised. 从统计角度看,正如@Roland在评论中提到的那样,不建议这样做。 But If you absolutely have to do it, then, 但是如果您绝对必须这样做,

fun1 <- function(x, n){
  t <- which((x - mean(x))/mean(x) > n)[1]
  x[t] <- NA
  return(x)
}

df1[-1] <- t(apply(df1[-1], 1, fun1, 0.5))

df1
#               Name mpg_1 cyl_1 disp_1 hp_1
#1         Mazda RX4   125   114    113   NA
#2     Mazda RX4 Wag   133   153    143  221
#3        Datsun 710   143   112    144  113
#4    Hornet 4 Drive   141   136    131   NA
#5 Hornet Sportabout   134   128     NA  134
#6           Valiant   238   155     NA  151

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM