简体   繁体   English

如何从R中的数据框中删除异常值?

[英]How to remove outliers from data frame in R?

I have a data frame with 25 Variables. 我有一个包含25个变量的数据框。 I want to remove the outliers from it. 我要从中删除异常值。

I have searched SO forum and found that there are custom kind of solutions people are proposing for different posts. 我搜索了SO论坛,发现人们针对不同的帖子提出了自定义解决方案。

Is there some standard R function that removes the outliers from the data? 是否有一些标准的R函数可以从数据中去除异常值?

Here are two functions I found from search. 这是我在搜索中找到的两个功能。 How good they are OR is there some standard same kind of better solution to achieve this in R in any package. 它们有多好,或者在任何包装中都存在某种标准的同类更好的解决方案,以在R中实现这一目标。

OR a function which I pass one column as argument & it returns outliers removed data. 或一个函数,我将一列作为参数传递给它并返回离群值删除的数据。

remove_outliers: Link 1 remove_outliers: 链接1

Removing outliers - quick & dirty: Link 2 移除异常值-快速又脏: 链接2

EDIT 编辑

The data in my data frame contains continuous data from two sources ie weather and ground. 我的数据框中的数据包含来自两个来源的连续数据,即天气和地面。 From weather, the predictors are temperature, humidity, wind, rain, solar radiation. 从天气来看,预测因素是温度,湿度,风,雨,太阳辐射。 And from ground are groundwater and soil moisture. 地下水和土壤水分来自地面。 I want to find a relation between soil moisture and other variables. 我想找到土壤水分与其他变量之间的关系。 I am analysing data using different models. 我正在使用不同的模型分析数据。 Now I want to se the results after removing the outliers from data. 现在,我要在从数据中删除异常值后确定结果。

EDIT I used and edited code from one of the tutorials I added reference above. 编辑我使用并编辑了上面添加的参考文献之一中的代码。 It's working fine when there are some outliers in the data. 当数据中存在一些异常值时,它工作正常。 But it raises error when there are no. 但是,如果没有,则会引发错误。 How to correct this. 如何纠正这个问题。

Here is code: 这是代码:

outlier_rem<-Data_combined #data-frame with 25 var, few have outliers

#removong outliers from the column

outliers <- boxplot(outlier_rem$var1, plot=FALSE)$out
#print(outliers)
#ol<-outlier_rem[which(outlier_rem$var1 %in% outliers),]
ol<-outlier_rem[-which(outlier_rem$var1 %in% outliers),]

dim(ol)
boxplot(ol)

Here is error msg when ol returns 0 vale. 当ol返回0谷时,这是错误消息。

> dim(ol)
[1]  0 25
> boxplot(ol)
no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfError in plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs) : 
  need finite 'ylim' values

I use the Chebyshev's inequality as a criterion for dropping extreme values. 我使用切比雪夫不等式作为降低极值的标准。 It has the advantage that it holds true in many probablility distributions. 它具有在许多概率分布中都适用的优点。 The rule states tha no more than 1/k^2 of the values can be more than k standard deviations away from the mean. 该规则规定,不超过平均值的1 / k ^ 2的值可以偏离均值大于k个标准偏差。 For example: 例如:

> x <- rchisq(1000, 13)
> 
> mean(x)
[1] 12.83906
> sd(x)
[1] 4.93234
> 
> Ndesv <- 5
> 
> x[x > (mean(x) + Ndesv * sd(x))]
[1] 38.7575
> 
> Conf <- (1 - 1 / Ndesv^2)
> print(Conf)
[1] 0.96
> 

Hope it helps you. 希望对您有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM