The following datasheet is from excel file
Part A B C D E F G H I J K L
XXX 0 1 1 2 0 1 2 3 1 2 1 0
YYY 0 1 2 2 0 30 1 1 0 1 10 0
....
So, I want to display those parts that contains outliers having logic of
[median – t * MAD, median + t * MAD]
So how to code this using R by function for large amount of data?
You would want to calculate robust Z-scores based on median and MAD (median of absolute deviations) instead of non-robust standard mean and SD. Then assess your data using Z, with Z=0 meaning on median, Z=1 one MAD out, etc.
Let's assume we have the following data, where one set is outliers:
df <- rbind( data.frame(tag='normal', res=rnorm(1000)*2.71), data.frame(tag='outlier', res=rnorm(20)*42))
then Z it:
df$z <- with(df, (res - median(res))/mad(res))
that gives us something like this:
> head(df)
tag res z
1 normal -3.097 -1.0532
2 normal -0.650 -0.1890
3 normal 1.200 0.4645
4 normal 1.866 0.6996
5 normal -6.280 -2.1774
6 normal 1.682 0.6346
Then cut it into Z-bands, eg.
df$band <- cut(df$z, breaks=c(-99,-3,-1,1,3,99))
That can be analyzed in a straightforward way:
> addmargins(xtabs(~band+tag, df))
tag
band normal outlier Sum
(-99,-3] 1 9 10
(-3,-1] 137 0 137
(-1,1] 719 2 721
(1,3] 143 1 144
(3,99] 0 8 8
Sum 1000 20 1020
As can be seen, obviously, the ones with the biggest Zs (those being in the (-99,-3) and (3,99) Z-band, are those from the outlier community).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.