[英]Removing outlier from excel using R code
The following datasheet is from excel file以下数据表来自excel文件
Part A B C D E F G H I J K L
XXX 0 1 1 2 0 1 2 3 1 2 1 0
YYY 0 1 2 2 0 30 1 1 0 1 10 0
....
So, I want to display those parts that contains outliers having logic of所以,我想显示那些包含具有逻辑的异常值的部分
[median – t * MAD, median + t * MAD]
So how to code this using R by function for large amount of data?那么如何使用 R by function 对大量数据进行编码呢?
You would want to calculate robust Z-scores based on median and MAD (median of absolute deviations) instead of non-robust standard mean and SD.您可能希望基于中位数和 MAD(绝对偏差中位数)而不是非稳健标准平均值和 SD 来计算稳健的 Z 分数。 Then assess your data using Z, with Z=0 meaning on median, Z=1 one MAD out, etc.然后使用 Z 评估您的数据,Z=0 表示中位数,Z=1 表示 MAD,等等。
Let's assume we have the following data, where one set is outliers:假设我们有以下数据,其中一组是异常值:
df <- rbind( data.frame(tag='normal', res=rnorm(1000)*2.71), data.frame(tag='outlier', res=rnorm(20)*42))
then Z it:然后Z它:
df$z <- with(df, (res - median(res))/mad(res))
that gives us something like this:这给了我们这样的东西:
> head(df)
tag res z
1 normal -3.097 -1.0532
2 normal -0.650 -0.1890
3 normal 1.200 0.4645
4 normal 1.866 0.6996
5 normal -6.280 -2.1774
6 normal 1.682 0.6346
Then cut it into Z-bands, eg.然后将其切成 Z 带,例如。
df$band <- cut(df$z, breaks=c(-99,-3,-1,1,3,99))
That can be analyzed in a straightforward way:这可以用简单的方式分析:
> addmargins(xtabs(~band+tag, df))
tag
band normal outlier Sum
(-99,-3] 1 9 10
(-3,-1] 137 0 137
(-1,1] 719 2 721
(1,3] 143 1 144
(3,99] 0 8 8
Sum 1000 20 1020
As can be seen, obviously, the ones with the biggest Zs (those being in the (-99,-3) and (3,99) Z-band, are those from the outlier community).可以看出,显然,具有最大 Z 的那些(那些在 (-99,-3) 和 (3,99) Z 波段中的,是来自异常值社区的那些)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.