简体   繁体   English

scipy.stats可以识别并掩盖明显的异常值吗?

[英]Can scipy.stats identify and mask obvious outliers?

With scipy.stats.linregress I am performing a simple linear regression on some sets of highly correlated x,y experimental data, and initially visually inspecting each x,y scatter plot for outliers. 使用scipy.stats.linregress,我在一些高度相关的x,y实验数据集上执行简单的线性回归,并且最初在视觉上检查每个x,y散点图以获得异常值。 More generally (ie programmatically) is there a way to identify and mask outliers? 更一般地(即以编程方式)是否有一种方法来识别和屏蔽异常值?

The statsmodels package has what you need. statsmodels包具有您需要的功能。 Look at this little code snippet and its output: 看看这个小代码片段及其输出:

# Imports #
import statsmodels.api as smapi
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*10 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make graph #
regression = smapi.OLS(x, y).fit()
figure = smgraphics.regressionplots.plot_fit(regression, 0)
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test) if t[2] < 0.5)
print 'Outliers: ', list(outliers)

示例图1

Outliers: [(15, 220)]

Edit 编辑

With the newer version of statsmodels , things have changed a bit. 随着更新版本的statsmodels ,事情发生了一些变化。 Here is a new code snippet that shows the same type of outlier detection. 这是一个新的代码段,显示了相同类型的异常值检测。

# Imports #
from random import random
import statsmodels.api as smapi
from statsmodels.formula.api import ols
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*(10+random())+200 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make fit #
regression = ols("data ~ x", data=dict(data=y, x=x)).fit()
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test.icol(2)) if t < 0.5)
print 'Outliers: ', list(outliers)
# Figure #
figure = smgraphics.regressionplots.plot_fit(regression, 1)
# Add line #
smgraphics.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])

示例图2

Outliers: [(15, 220)]

scipy.stats doesn't have anything directly for outliers, so as answer some links and advertising for statsmodels (which is a statistics complement for scipy.stats) scipy.stats没有任何直接用于异常值的东西,所以回答一些链接和statsmodels的广告(这是scipy.stats的统计补充)

for identifying outliers 用于识别异常值

http://jpktd.blogspot.ca/2012/01/influence-and-outlier-measures-in.html http://jpktd.blogspot.ca/2012/01/influence-and-outlier-measures-in.html

http://jpktd.blogspot.ca/2012/01/anscombe-and-diagnostic-statistics.html http://jpktd.blogspot.ca/2012/01/anscombe-and-diagnostic-statistics.html

http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.outliers_influence.OLSInfluence.html http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.outliers_influence.OLSInfluence.html

instead of masking, a better approach is to use a robust estimator 而不是掩蔽,更好的方法是使用稳健的估计器

http://statsmodels.sourceforge.net/devel/rlm.html http://statsmodels.sourceforge.net/devel/rlm.html

with examples, where unfortunately the plots are currently not displayed http://statsmodels.sourceforge.net/devel/examples/generated/tut_ols_rlm.html 有例子,遗憾的是,这些情节目前没有显示http://statsmodels.sourceforge.net/devel/examples/generated/tut_ols_rlm.html

RLM downweights outliers. RLM下调异常值。 The estimation results have a weights attribute, and for outliers the weights are smaller than 1. This can also be used for finding outliers. 估计结果具有weights属性,对于异常值,权重小于1.这也可用于查找异常值。 RLM is also more robust if the are several outliers. 如果是几个异常值, RLM也更强大。

More generally (ie programmatically) is there a way to identify and mask outliers? 更一般地(即以编程方式)是否有一种方法来识别和屏蔽异常值?

Various outlier detection algorithms exist; 存在各种异常检测算法; scikit-learn implements a few of them. scikit-learn实现了其中的一些。

[Disclaimer: I'm a scikit-learn contributor.] [免责声明:我是一名学习贡献者。]

It is also possible to limit the effect of outliers using scipy.optimize.least_squares . 也可以使用scipy.optimize.least_squares限制异常值的影响。 Especially, take a look at the f_scale parameter: 特别是,看一下f_scale参数:

Value of soft margin between inlier and outlier residuals, default is 1.0. 内部和外部残差之间的软边际价值,默认为1.0。 ... This parameter has no effect with loss='linear', but for other loss values it is of crucial importance. ...此参数对loss ='linear'没有影响,但对于其他损失值,它至关重要。

On the page they compare 3 different functions: the normal least_squares , and two methods involving f_scale : 在网页上,他们比较3个不同的功能:正常least_squares ,以及涉及两种方法f_scale

res_lsq =     least_squares(fun, x0, args=(t_train, y_train))
res_soft_l1 = least_squares(fun, x0, loss='soft_l1', f_scale=0.1, args=(t_train, y_train))
res_log =     least_squares(fun, x0, loss='cauchy', f_scale=0.1, args=(t_train, y_train))

最小二乘比较

As can be seen, the normal least squares is a lot more affected by data outliers, and it can be worth playing around with different loss functions in combination with different f_scales . 可以看出,正常的最小二乘方受数据异常值的影响要f_scales ,并且结合不同的f_scales可以值得玩不同的loss函数。 The possible loss functions are (taken from the documentation): 可能的损失函数(取自文档):

‘linear’ : Gives a standard least-squares problem.
‘soft_l1’: The smooth approximation of l1 (absolute value) loss. Usually a good choice for robust least squares.
‘huber’  : Works similarly to ‘soft_l1’.
‘cauchy’ : Severely weakens outliers influence, but may cause difficulties in optimization process.
‘arctan’ : Limits a maximum loss on a single residual, has properties similar to ‘cauchy’.

The scipy cookbook has a neat tutorial on robust nonlinear regression. scipy cookbook 有一个关于鲁棒非线性回归的简洁教程

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM