简体   繁体   English

二进制特征应该是one-hot编码吗?

[英]Should binary features be one-hot encoded?

I'm working with data that consists of a few dozen binary features about people which basically come down to "person has feature x" [True/False].我正在处理由几十个关于人的二进制特征组成的数据,这些特征基本上归结为“人具有特征 x”[真/假]。

From what I can find online categorical data should be one-hot encoded instead of assigning an arbitrary value for each category because you can't say "category 1 is less than category 2".从我可以找到的在线分类数据来看,应该是单热编码,而不是为每个类别分配任意值,因为您不能说“类别 1 小于类别 2”。 So the solution is to create a dummy variable for each category:所以解决方案是为每个类别创建一个虚拟变量:

Cat || dummy 1 | dummy 2 | dummy 3
____||_________|_________|________
 1  ||   1     |   0     |   0
 2  ||   0     |   1     |   0
 3  ||   0     |   0     |   1

Now for binary features one can choose between using the variable directly (1 for true, 0 for false) or use two dummy variables ((1, 0) for true, (0, 1) for false.).现在对于二元特征,可以选择直接使用变量(1 表示真,0 表示假)或使用两个虚拟变量((1, 0) 表示真,(0, 1) 表示假。)。 But I can't find any sources that show/explain what the best approach is.但我找不到任何显示/解释最佳方法的来源。

I myself am conflicted, because on one hand, the dummy variables reduces the importance of each individual variable and it is show that at least in some cases the accuracy of the model suffers, source .我自己很矛盾,因为一方面,虚拟变量降低了每个单独变量的重要性,并且表明至少在某些情况下模型的准确性会受到影响, 来源 But on the other hand, this can also encode missing data (in the form of (0, 0)).但另一方面,这也可以编码丢失的数据(以 (0, 0) 的形式)。 Furthermore, is it possible to say "False is less than True"?此外,是否可以说“假小于真”?

I'm actually using a Random Forest in python, and I know that tree-based classifiers such as Random Forests support categorial data, but the Sklearn package hasn't implemented this yet.我实际上在 python 中使用了随机森林,我知道基于树的分类器(如随机森林)支持分类数据,但 Sklearn 包尚未实现这一点。

I wrote a small test on the Sklearn digits data set.我在 Sklearn 数字数据集上写了一个小测试。 This data set has a number of 8 by 8 images of digits (0-9), each pixel has a value between 0 and 16 and a simple model can use this to learn to recognize the digits.该数据集有多个 8 x 8 的数字图像 (0-9),每个像素的值在 0 到 16 之间,一个简单的模型可以使用它来学习识别数字。

For my test I change the values of > 8 to True and <= 8 to False.对于我的测试,我将 > 8 的值更改为 True,将 <= 8 的值更改为 False。 The accuracy ofcourse suffers when compared to the original data, but when I implement one-hot encoding, thus changing True to (1, 0) and False to (0, 1) I can't find a significant difference compared to the binary encoding.与原始数据相比,准确性当然会受到影响,但是当我实现单热编码时,因此将 True 更改为 (1, 0) 并将 False 更改为 (0, 1) 与二进制编码相比,我找不到显着差异.

An explanation of the recommended approach would be greatly appreciated!对推荐方法的解释将不胜感激!

Converting a binary variable that takes the values of [0, 1] into a one-hot encoded of [(0, 1), (1, 0)] is redundant and not recommended for the following reasons ( some of them are already mentioned in the comment above but just to expand on this ):将取 [0, 1] 值的二进制变量转换为 [(0, 1), (1, 0)] 的单热编码是多余的,不建议这样做,原因如下(其中一些已经提到在上面的评论中,但只是为了扩展这一点):

  1. It is redundant because the binary variable is already in a form similar to the one-hot encoded, where the last column is dropped as it does not make any difference with or without it, because it can be inferred from the first given column: If I give you [(0, ), (1,)], you can know the complementary column [(, 1), (, 0)].它是多余的,因为二进制变量已经是一种类似于 one-hot 编码的形式,其中最后一列被删除,因为它有或没有它没有任何区别,因为它可以从第一个给定的列中推断出来:如果我给你[(0, ), (1,)],你可以知道补列[(, 1), (, 0)]。

  2. Suppose you have more than one binary variable, say 4 for example.假设您有多个二元变量,例如 4。 If you convert them into one-hot encoded form, the dimension will increase from 4 to 8. The latter is not recommended for the following reasons:如果将它们转换为one-hot编码形式,维数将从4增加到8,不推荐使用后者,原因如下:

    • The Curse of Dimensionality : High dimensional data can be so troublesome.维度诅咒:高维数据可能很麻烦。 That's because a lot of algorithms (eg clustering algorithms) use the Euclidean Distance which, due to the squared terms, is sensitive to noise.这是因为许多算法(例如聚类算法)使用欧几里得距离,由于平方项,它对噪声很敏感。 As a matter of fact, data points spread too thin as the dimensions increase, making data extremely noisy.事实上,随着维度的增加,数据点散布得太细,使得数据变得非常嘈杂。 Besides, the concept of neighborhood becomes meaningless, and approaches that are based on finding the relative contrast between distances of the data points become unreliable.此外,邻域的概念变得毫无意义,基于寻找数据点距离之间的相对对比度的方法变得不可靠。

    • Time & Memory Complexity : It is intuitive that increasing the number of features will cost the algorithm more execution time and memory space requirement.时间和内存复杂度:很直观,增加特征的数量会花费算法更多的执行时间和内存空间需求。 To name a few, algorithms that use the Covariance Matrix in its computation will get affected.仅举几例,在计算中使用协方差矩阵的算法将受到影响。 Polynomial algorithms will end up with too many terms...and so on.多项式算法最终会得到太多项……等等。 In general, the learning usually is faster with less features especially if the extra features are redundant.一般来说,学习通常更快,特征更少,特别是如果额外的特征是多余的。

    • Multi-Collinearity : Since the last column in the one-hot encoded form of the binary variable is redundant and 100% correlated with the first column, this will cause troubles to the Linear Regression-based Algorithms.多重共线性:由于二进制变量的one-hot编码形式的最后一列是冗余的,并且与第一列100%相关,这会给基于线性回归的算法带来麻烦。 For example, since the ordinary least squares estimates involve inverting the matrix, a computer algorithm may be unsuccessful in obtaining an approximate inverse, if a lot of features are correlated, and hence the inverse may be numerically inaccurate.例如,由于普通最小二乘估计涉及矩阵求逆,如果许多特征相关,计算机算法可能无法成功获得近似逆,因此逆可能在数值上不准确。 Also, linear models work by observing the changes in the dependent variable y with the unit changes in one independent variable after holding all other independent variables as constants , yet in case independent variables are highly correlated, the latter fails ( there are more other consequences of Multi-Collinearity ) ( although some other algorithms might be less sensitive to this as in Decision Trees ).此外,线性模型通过观察因变量y的变化而工作,在将所有其他自变量保持为常数后,一个自变量的单位变化,但如果自变量高度相关,后者失败(还有更多其他后果)多重共线性)(尽管其他一些算法可能对此不那么敏感,如决策树)。

    • Overfitting-prone : In general, too many features ( regardless if they're correlated or not ) may overfit your model and fail to generalize to new examples, as every data point in your dataset will be fully identified by the given features ( search Andrew NG lectures, he explained this in detail )容易过度拟合:通常,太多的特征(无论它们是否相关)可能会过度拟合您的模型并且无法推广到新示例,因为您的数据集中的每个数据点都将被给定的特征完全识别(搜索 Andrew NG讲座,他详细解释了

Summary概括

In a nutshell, converting a binary variable into a one-hot encoded one is redundant and may lead to troubles that are needless and unsolicited.简而言之,将二进制变量转换为单热编码的变量是多余的,可能会导致不必要和不请自来的麻烦。 Although correlated features may not always worsen your model, yet they will not always improve it either.尽管相关特征可能并不总是会使您的模型恶化,但它们也不总是会改善它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM