scipy.stats可以識別並掩蓋明顯的異常值嗎？

Question

使用scipy.stats.linregress，我在一些高度相關的x，y實驗數據集上執行簡單的線性回歸，並且最初在視覺上檢查每個x，y散點圖以獲得異常值。 更一般地（即以編程方式）是否有一種方法來識別和屏蔽異常值？

Answer 1

statsmodels包具有您需要的功能。 看看這個小代碼片段及其輸出：

# Imports #
import statsmodels.api as smapi
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*10 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make graph #
regression = smapi.OLS(x, y).fit()
figure = smgraphics.regressionplots.plot_fit(regression, 0)
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test) if t[2] < 0.5)
print 'Outliers: ', list(outliers)

示例圖1

Outliers: [(15, 220)]

編輯

隨着更新版本的statsmodels ，事情發生了一些變化。 這是一個新的代碼段，顯示了相同類型的異常值檢測。

# Imports #
from random import random
import statsmodels.api as smapi
from statsmodels.formula.api import ols
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*(10+random())+200 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make fit #
regression = ols("data ~ x", data=dict(data=y, x=x)).fit()
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test.icol(2)) if t < 0.5)
print 'Outliers: ', list(outliers)
# Figure #
figure = smgraphics.regressionplots.plot_fit(regression, 1)
# Add line #
smgraphics.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])

示例圖2

Outliers: [(15, 220)]

Answer 2

scipy.stats沒有任何直接用於異常值的東西，所以回答一些鏈接和statsmodels的廣告（這是scipy.stats的統計補充）

用於識別異常值

http://jpktd.blogspot.ca/2012/01/influence-and-outlier-measures-in.html

http://jpktd.blogspot.ca/2012/01/anscombe-and-diagnostic-statistics.html

http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.outliers_influence.OLSInfluence.html

而不是掩蔽，更好的方法是使用穩健的估計器

http://statsmodels.sourceforge.net/devel/rlm.html

有例子，遺憾的是，這些情節目前沒有顯示http://statsmodels.sourceforge.net/devel/examples/generated/tut_ols_rlm.html

RLM下調異常值。 估計結果具有weights屬性，對於異常值，權重小於1.這也可用於查找異常值。 如果是幾個異常值， RLM也更強大。

Answer 3

更一般地（即以編程方式）是否有一種方法來識別和屏蔽異常值？

存在各種異常檢測算法; scikit-learn實現了其中的一些。

[免責聲明：我是一名學習貢獻者。]

Answer 4

也可以使用scipy.optimize.least_squares限制異常值的影響。 特別是，看一下f_scale參數：

內部和外部殘差之間的軟邊際價值，默認為1.0。 ...此參數對loss ='linear'沒有影響，但對於其他損失值，它至關重要。

在網頁上，他們比較3個不同的功能：正常least_squares ，以及涉及兩種方法f_scale ：

res_lsq =     least_squares(fun, x0, args=(t_train, y_train))
res_soft_l1 = least_squares(fun, x0, loss='soft_l1', f_scale=0.1, args=(t_train, y_train))
res_log =     least_squares(fun, x0, loss='cauchy', f_scale=0.1, args=(t_train, y_train))

可以看出，正常的最小二乘方受數據異常值的影響要f_scales ，並且結合不同的f_scales可以值得玩不同的loss函數。 可能的損失函數（取自文檔）：

‘linear’ : Gives a standard least-squares problem.
‘soft_l1’: The smooth approximation of l1 (absolute value) loss. Usually a good choice for robust least squares.
‘huber’  : Works similarly to ‘soft_l1’.
‘cauchy’ : Severely weakens outliers influence, but may cause difficulties in optimization process.
‘arctan’ : Limits a maximum loss on a single residual, has properties similar to ‘cauchy’.

scipy cookbook 有一個關於魯棒非線性回歸的簡潔教程。

scipy.stats可以識別並掩蓋明顯的異常值嗎？

問題描述

4 個解決方案

解決方案1
26 已采納 2013-04-23 09:43:34

編輯

解決方案2
7 2012-04-20 02:49:46

解決方案3
6 2012-04-19 15:46:58

解決方案4
0 2017-05-04 14:57:13

scipy.stats可以識別並掩蓋明顯的異常值嗎？

問題描述

4 個解決方案

解決方案1 26 已采納 2013-04-23 09:43:34

編輯

解決方案2 7 2012-04-20 02:49:46

解決方案3 6 2012-04-19 15:46:58

解決方案4 0 2017-05-04 14:57:13

解決方案1
26 已采納 2013-04-23 09:43:34

解決方案2
7 2012-04-20 02:49:46

解決方案3
6 2012-04-19 15:46:58

解決方案4
0 2017-05-04 14:57:13