简体   繁体   中英

Removing outlier from excel using R code

The following datasheet is from excel file

Part  A  B  C  D  E  F   G  H  I  J  K  L  
XXX   0  1  1  2  0  1   2  3  1  2  1  0
YYY   0  1  2  2  0  30  1  1  0  1  10 0
....

So, I want to display those parts that contains outliers having logic of

[median – t * MAD, median + t * MAD]

So how to code this using R by function for large amount of data?

You would want to calculate robust Z-scores based on median and MAD (median of absolute deviations) instead of non-robust standard mean and SD. Then assess your data using Z, with Z=0 meaning on median, Z=1 one MAD out, etc.

Let's assume we have the following data, where one set is outliers:

df <- rbind( data.frame(tag='normal', res=rnorm(1000)*2.71), data.frame(tag='outlier', res=rnorm(20)*42))

then Z it:

df$z <- with(df, (res - median(res))/mad(res))

that gives us something like this:

> head(df)
     tag    res       z
1 normal -3.097 -1.0532
2 normal -0.650 -0.1890
3 normal  1.200  0.4645
4 normal  1.866  0.6996
5 normal -6.280 -2.1774
6 normal  1.682  0.6346

Then cut it into Z-bands, eg.

df$band <- cut(df$z, breaks=c(-99,-3,-1,1,3,99))

That can be analyzed in a straightforward way:

> addmargins(xtabs(~band+tag, df))
          tag
band       normal outlier  Sum
  (-99,-3]      1       9   10
  (-3,-1]     137       0  137
  (-1,1]      719       2  721
  (1,3]       143       1  144
  (3,99]        0       8    8
  Sum        1000      20 1020

As can be seen, obviously, the ones with the biggest Zs (those being in the (-99,-3) and (3,99) Z-band, are those from the outlier community).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM