简体繁体 English

SkLearn-为什么LabelEncoder（）。仅适用于训练数据

[英]SkLearn - Why LabelEncoder().fit only to training data

原文 2018-09-11 15:56:52 0 2 python/ machine-learning/ scikit-learn/ categorical-data

I may be missing something but after following for quite a long time now the suggestion (of some senior data scientists) to LabelEncoder().fit only to training data and not also to test data then I start to think why is this really necessary. 我可能会丢失一些东西，但是在很长一段时间后，（对于一些高级数据科学家而言） LabelEncoder().fit的建议仅适用于训练数据，而不适合测试数据，然后我开始思考为什么这确实必要。

Specifically, at SkLearn if I want to LabelEncoder().fit only to training data then there are two different scenarios: 具体来说，在SkLearn如果我只想让LabelEncoder().fit训练数据，则有两种不同的情况：

The test set has some new labels in relation to the training set. 测试集相对于训练集具有一些新标签。 For example, the test set has only the labels ['USA', 'UK'] while the test set has the labels ['USA', 'UK', 'France'] . 例如，测试集只有标签['USA', 'UK']而测试集只有标签['USA', 'UK', 'France'] 。 Then, as it has been reported elsewhere (eg Getting ValueError: y contains new labels when using scikit learn's LabelEncoder ), you are getting an error if you try to transform the test set according to this LabelEncoder() because exactly it encounters a new label. 然后，正如其他地方所报道的那样（例如，使用scikit learning的LabelEncoder时，Getting ValueError：y包含新标签），如果尝试根据此LabelEncoder()转换测试集，则会遇到错误，因为它恰好遇到了新标签。
The test set has the same labels as the training set. 测试集具有与训练集相同的标签。 For example, both the training and the test set have the labels ['USA', 'UK', 'France'] . 例如，训练集和测试集都具有标签['USA', 'UK', 'France'] 。 However, then LabelEncoder().fit only to training data is essentially redundant since the test set have the same known values as the training set. 但是，由于测试集具有与训练集相同的已知值，所以仅LabelEncoder().fit训练数据的LabelEncoder().fit本质上是多余的。

Hence, what is the point of LabelEncoder().fit only to training data and then LabelEncoder().tranform both the training and the test data if at case (1) this throws an error and if at case (2) it is redundant? 因此， LabelEncoder().fit仅适用于训练数据，然后LabelEncoder().tranform训练和测试数据的意义是什么，如果在情况（1）下抛出错误，而在情况（2）下冗余？

Let my clarify that the (pretty knowledgeable) senior data scientists whom I have seen to LabelEncoder().fit only to training data, they had justified this by saying that the test set should be entirely new to even the simplest model like an encoder and it should not be mixed at any fitting with the training data. 让我澄清一下，我见过的LabelEncoder().fit的（非常有知识的）高级数据科学家仅LabelEncoder().fit训练数据，他们通过说测试集甚至对于最简单的模型（例如编码器和请勿将其与培训数据进行任何混合。 They did not mention anything about any production or out-of-vocabulary purposes. 他们没有提及任何有关生产或非语言用途的内容。

2 个解决方案

The main reason to do so is because in inference/production time (not testing) you might encounter labels that you have never seen before (and you won't be able to call fit() even if you wanted to). 这样做的主要原因是因为在推论/产生时间（不进行测试）中，您可能会遇到从未见过的标签fit()即使您愿意，也无法调用fit() ）。

In scenario 2 where you are guaranteed to always have the same labels across folds and in production it is indeed redundant. 在场景2中，可以保证折叠时始终具有相同的标签，并且在生产中确实是多余的。 But are you still guaranteed to see the same in production? 但是，您仍然可以保证在生产中看到相同的东西吗？

In scenario 1 you need to find a solution to handle unknown labels. 在方案1中，您需要找到一种处理未知标签的解决方案。 One popular approach is map every unknown label into an unknown token. 一种流行的方法是将每个未知标签映射到unknown令牌中。 In natural language processing this is call the "Out of vocabulary" problem and the above approach is often used. 在自然语言处理中，这被称为“词汇不足”问题，并且经常使用上述方法。

To do so and still use LabelEncoder() you can pre-process your data and perform the mapping yourself. 为此，仍然使用LabelEncoder()您可以预处理数据并自己执行映射。

It's hard to guess why the senior data scientists gave you that advice without context, but I can think of at least one reason they may have had in mind. 很难猜测为什么高级数据科学家会在没有上下文的情况下为您提供建议，但是我可以想到至少他们想到的一个原因。

If you are in the first scenario, where the training set does not contain the full set of labels, then it is often helpful to know this and so the error message is useful information. 如果您在第一种情况下，训练集不包含完整的标签集，那么了解这一点通常会很有帮助，因此错误消息是有用的信息。

Random sampling can often miss rare labels and so taking a fully random sample of all of your data is not always the best way to generate a training set. 随机抽样通常会遗漏稀有标签，因此对您所有数据进行完全随机抽样并不总是生成训练集的最佳方法。 If France does not appear in your training set, then your algorithm will not be learning from it, so you may want to use a randomisation method that ensures your training set is representative of minority cases. 如果法国未出现在您的训练集中，则您的算法将不会从中学习，因此您可能希望使用随机方法来确保您的训练集中代表少数情况。 On the other hand, using a different randomisation method may introduce new biases. 另一方面，使用不同的随机方法可能会引入新的偏差。

Once you have this information, it will depend on your data and problem to be solved as to what the best approach to solve it will be, but there are cases where it is important to have all labels present. 获得此信息后，将取决于您的数据和要解决的最佳解决方案问题，但是在某些情况下，必须具有所有标签的重要性。 A good example would be identifying the presence of a very rare illness. 一个很好的例子就是确定一种非常罕见的疾病的存在。 If your training data doesn't include the label indicating that the illness is present, then you better re-sample. 如果您的训练数据不包括表明存在该疾病的标签，则最好重新采样。