在Python中重新编码分类变量

Question

I've been trying to learn Python 3.6 using the Anaconda distribution. 我一直在尝试使用Anaconda发行版学习Python 3.6。 I've hit a snag with the content of the online course I'm using, and could use some help working through some error messages. 我对正在使用的在线课程的内容感到不满意，可以通过一些错误消息获得帮助。 I'd ask the instructors of the course, but they don't seem very responsive to questions from students. 我会问这门课程的讲师，但他们似乎对学生的问题反应不大。

I've been having some trouble working with the three dominant classes used to recode categorical data. 我在处理用于重新编码分类数据的三个主要类时遇到了一些麻烦。 As I understand it, there are three classes drawn from the scikitlearn package used for recoding variables: LabelEncoder, OneHotEncoder and LabelBinarizer. 据我了解，从scikitlearn包中提取了三个用于重新编码变量的类：LabelEncoder，OneHotEncoder和LabelBinarizer。 I have attempted to employ each to recode a categorical variable inside a dataset, but keep getting errors for each. 我尝试使用每种方法来重新编码数据集中的分类变量，但是每种方法都会出错。

Please pardon my relative noobness for the samples codes. 请原谅我相对无礼的示例代码。 As one might have guessed by the baseness of my question, I am not well versed in python. 正如我的问题的基础可能已经猜到的那样，我并不精通python。

The object X contains a few columns, the first being a categorical string I need to convert (If someone could also tell me how to insert tables, that'd be helpful. Do I have to use HTML?): 对象X包含几列，第一列是我需要转换的分类字符串（如果有人还可以告诉我如何插入表，那将很有帮助。我必须使用HTML吗？）：

"Fish" 1 5 3 “鱼” 1 5 3
"Dog" 2 6 9 “狗” 2 6 9
"Dog" 8 8 8 “狗” 8 8 8
"Cat" 5 7 6 “猫” 5 7 6
"Cat" 6 6 6 “猫” 6 6 6

Label Encoder Attempt 标签编码器尝试

Below is the code I attempted to implement, and the resulting error message I received for the object X, which has roughly the properties I described above. 以下是我尝试实现的代码，以及我收到的针对对象X的错误消息，该错误消息大致具有我上面描述的属性。

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder 
X[:, 0] = LabelEncoder.fit_transform(X[:, 0])

TypeError: fit_transform() missing 1 required positional argument: 'y'

What is throwing me is I thought the above code was clearly defining what y is, the first column of X. 让我感到困惑的是，我认为上面的代码清楚地定义了y是X的第一列。

OneHotEncoder OneHotEncoder

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform[X].toarray()

TypeError: 'method' object is not subscriptable

Label Binarizer 标签二值化器

I've found this one the hardest to understand, and actually couldn't make an attempt based on the structure of the dataset. 我发现这是最难理解的，实际上无法根据数据集的结构进行尝试。

Any guidance or suggestions you could provide would be endlessly helpful. 您可以提供的任何指导或建议将无穷无尽。

Answer 1

Lets take it step by step. 让我们一步一步来。

First load the data you showed in a numpy array of name X 首先加载名称X的numpy数组中显示的数据

import numpy as np
X = np.array([["Fish", 1, 5, 3],
              ["Dog",  2, 6, 9],
              ["Dog",  8, 8, 8],
              ["Cat",  5, 7, 6],
              ["Cat",  6, 6, 6]])

Now try your codes. 现在尝试您的代码。

1) LabelEncoder 1）LabelEncoder

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder 
X[:, 0] = LabelEncoder.fit_transform(X[:, 0])

The thing you are doing wrong here is that you are using the class LabelEncoder as an object, calling fit_transform on it. 您在这里做错的事情是您将LabelEncoder类用作对象，对其调用fit_transform 。 So correct that by: 因此，请通过以下方式更正：

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

See the changes in line 2 and 3 above. 请参阅上面第2行和第3行中的更改。 First I made an object labelencoder_X of the LabelEncoder class by calling LabelEncoder() and then use that object to call fit_transform() by using labelencoder_X.fit_transform() . 首先，我通过调用LabelEncoder() LabelEncoder类的对象labelencoder_X ，然后使用该对象通过labelencoder_X.fit_transform（）来调用labelencoder_X.fit_transform() 。 Then this code dont give any error and new X is: 然后此代码不给出任何错误，新的X为：

Output:
array([['2', '1', '5', '3'],
       ['1', '2', '6', '9'],
       ['1', '8', '8', '8'],
       ['0', '5', '7', '6'],
       ['0', '6', '6', '6']], dtype='|S4')

See that the first column has been changed successfully. 看到第一列已成功更改。

2) OneHotEncoder 2）OneHotEncoder

Your code: 您的代码：

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform[X].toarray()

Now here, you are not doing the mistake you did in LabelEncoder. 现在，在这里，您没有犯过在LabelEncoder中犯的错误。 You are correctly initializing the object by calling OneHotEncoder(...) . 您正在通过调用OneHotEncoder(...)正确初始化对象。 But you made a mistake by using fit_transform[X] . 但是您使用fit_transform[X]犯了一个错误。 You see fit_transform is a method and should be called using the round parentheses like this: fit_transform() . 您会看到fit_transform是一种方法，应使用如下的圆括号来调用： fit_transform() 。

See this question for more details about the error. 有关该错误的更多详细信息，请参见此问题。

The correct code should be: 正确的代码应为：

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform(X).toarray()

Output: 
array([[0., 0., 1., 1., 5., 3.],
       [0., 1., 0., 2., 6., 9.],
       [0., 1., 0., 8., 8., 8.],
       [1., 0., 0., 5., 7., 6.],
       [1., 0., 0., 6., 6., 6.]])

Note: The above code should be called on X which have been already transformed with LabelEncoder. 注意：上面的代码应该在已经用LabelEncoder转换过的X上调用。 If you use it on original X, it will still throw an error. 如果在原始X上使用它，它仍然会引发错误。

3) LabelBinarizer This is nothing really different from LabelEncoder, just that it will do the one-hot encoding as well for the supplied column. 3）LabelBinarizer这与LabelEncoder并没有什么真正的不同，只是它将对提供的列也进行一键编码。

from sklearn.preprocessing import LabelBinarizer
labelencoder_X =LabelBinarizer()
new_binarized_val = labelencoder_X.fit_transform(X[:, 0])

Output:
array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0]])

Note: The LabelBinarizer code I used on original X from your question, not the already encoded one. 注意：我在问题的原始X上使用的LabelBinarizer代码，而不是已经编码的代码。 And the output shows only the binarized form of first column. 输出仅显示第一列的二进制形式。

Hope this makes things clear. 希望这可以弄清楚。

在Python中重新编码分类变量

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-04-06 04:48:47

在Python中重新编码分类变量

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-04-06 04:48:47

解决方案1
3 已采纳 2018-04-06 04:48:47