简体繁体 English

为NN删除数据中的异常值，好还是坏主意？

[英]Removing outlier in data for NN, good or bad idea?

原文 2019-07-30 23:28:19 9 1 python/ neural-network/ outliers

I have some data that has some outliers. 我有一些离群值的数据。 My data however has a direction to it and has trends that i need to consider when looking for outlier. 但是，我的数据具有方向性，并且在寻找异常值时需要考虑一些趋势。 What an outlier is however, is not simply a yes or no answer. 但是，离群值并不是简单的是或否答案。 The only thing i can say is that the farther away a data point is from the trend, the more likely it is, that it is an outlier i would like to not include in my data. 我唯一能说的是，数据点离趋势越远，可能性就越大，这是我不希望在数据中包含的异常值。

Given things like stand deviation, linear regressions, and the chunk of data i am looking at all depend on context, there is no static function i know of to determine if something is an outlier or not. 给定诸如看台偏差，线性回归以及我正在查看的数据块之类的所有内容都取决于上下文，因此我不知道可以用来确定某物是否为异常值的静态函数。

I can select good outliers using various techniques but the problem is, anytime you get rid of outliers, you are using context of the data you are picking the outlier from. 我可以使用各种技术选择好的离群值，但是问题是，只要摆脱了离群值，便会使用从中选取离群值的数据的上下文。

I know that when you prepare your data for a NN, data has to always be prepared the exact same way. 我知道，当您为NN准备数据时，必须始终以完全相同的方式准备数据。 That is, it goes through a set of static processes/functions. 即，它经历一组静态过程/功能。 The techniques used to select outliers, require context, and context changes, so the function changes. 用于选择离群值的技术需要上下文和上下文更改，因此功能也会更改。 I am not sure if the differences in how an outlier is selected, is enough to throw of the integrity of the model. 我不确定选择异常值的方式是否足以引起模型的完整性。

If this is true, are there any good static methods to select an outlier? 如果是这样，是否有任何好的静态方法来选择异常值？

1 个解决方案

A model-independent way of selecting outliers is based upon the distribution of errors. 选择异常值的模型独立方法是基于误差的分布。 This boils down to: 归结为：

Fit the model with all data points 用所有数据点拟合模型
Calculate the residual error for each data point 计算每个数据点的残留误差
Eliminate outliers based on some threshold 根据某个阈值消除异常值
Re-fit the model from scratch with outliers removed 从零开始重新拟合模型，并移除异常值
(Optionally repeat until a termination condition is met, eg no outliers are removed) （可选地，重复执行直到满足终止条件，例如，未移除异常值）

The threshold of elimination is problem- and metric-dependent. 消除的阈值取决于问题和度量。 One approach to outlier elimination is computing a z-score on the residual errors (subtract the mean and divide by the standard deviation of the residual errors) and then removing any points with an absolute value greater than a defined threshold (which equates to number of standard deviations from the mean at which points are identified as outliers). 一种消除异常值的方法是在残差上计算z分数（减去均值并除以残差的标准偏差），然后删除绝对值大于定义阈值（等于与均值之间的标准偏差（在这些点处识别为离群值）。

https://en.wikipedia.org/wiki/Standard_score https://zh.wikipedia.org/wiki/Standard_score

This is a general, model-independent approach that assumes residuals are normally-distributed (or at least that outliers can be reasonably identified based on relative error). 这是一种通用的，与模型无关的方法，该方法假定残差是正态分布的（或者至少可以基于相对误差合理地确定异常值）。

If you have other assumptions regarding the distribution of the residual, you can apply other probabilistic criteria (eg fit a distribution on the residual errors, then apply a probabilistic threshold for each point). 如果您对残差的分布有其他假设，则可以应用其他概率标准（例如，对残差进行分布拟合，然后对每个点应用概率阈值）。 This is more involved though, and if you don't have any belief a priori about the characteristics of the residual error distribution (other than "large errors are likely outliers") then z-score is the way to go. 但是，这涉及的更多，并且如果您对残留误差分布的特征没有先验知识（“大误差可能是异常值”除外），那么z得分是可行的方法。

The foregoing discusses how to identify outliers, but doesn't address whether you should . 前面讨论了如何识别离群值，但没有讨论是否应该 。 This is an application-dependent question. 这是一个与应用程序有关的问题。 If outliers are not informative of behavior you want to model, then they can be removed from training. 如果离群值不能提供您要建模的行为的信息，则可以将其从训练中删除。 However, if you want your model to predict average (or other metric-optimizing) behavior inclusive of outliers, then they should be retained. 但是，如果您希望模型预测包括异常值在内的平均（或其他优化指标）行为，则应保留它们。