如何在 Sklearn 中执行 OneHotEncoding，获取值错误

Question

I just started learning machine learning, when practicing one of the task, I am getting value error, but I followed the same steps as the instructor does.我刚开始学习机器学习，在练习其中一项任务时，我遇到了价值错误，但我遵循了与讲师相同的步骤。

I am getting value error, please help.我收到值错误，请帮忙。

dff天涯

     Country    Name
 0     AUS      Sri
 1     USA      Vignesh
 2     IND      Pechi
 3     USA      Raj

First I performed labelencoding,首先我执行了标签编码，

X=dff.values
label_encoder=LabelEncoder()
X[:,0]=label_encoder.fit_transform(X[:,0])

out:
X
array([[0, 'Sri'],
       [2, 'Vignesh'],
       [1, 'Pechi'],
       [2, 'Raj']], dtype=object)

then performed One hot encoding for the same X然后对同一个 X 进行一次热编码

onehotencoder=OneHotEncoder( categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()

I am getting the below error:我收到以下错误：

ValueError                                Traceback (most recent call last)
<ipython-input-472-be8c3472db63> in <module>()
----> 1 X=onehotencoder.fit_transform(X).toarray()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
   1900         """
   1901         return _transform_selected(X, self._fit_transform,
-> 1902                                    self.categorical_features, copy=True)
   1903 
   1904     def _transform(self, X):

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
   1695     X : array or sparse matrix, shape=(n_samples, n_features_new)
   1696     """
-> 1697     X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
   1698 
   1699     if isinstance(selected, six.string_types) and selected == "all":

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: could not convert string to float: 'Raj'

Please edit my question is anything wrong, thanks in advance!请编辑我的问题有什么问题，提前致谢！

Answer 1

You can go directly to OneHotEncoding now without using the LabelEncoder , and as we move toward version 0.22 many might want to do things this way to avoid warnings and potential errors ( see DOCS and EXAMPLES ).您现在可以直接转到OneHotEncoding而不使用LabelEncoder ，随着我们向 0.22 版迈进，许多人可能希望以这种方式做事以避免警告和潜在错误（请参阅文档和示例）。

Example code 1 where ALL columns are encoded and where the categories are explicitly specified:示例代码 1，其中对所有列进行编码并明确指定类别：

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

countries = np.unique(X[:,0])
names = np.unique(X[:,1])

ohe = OneHotEncoder(categories=[countries, names])
X = ohe.fit_transform(X).toarray()

print (X)

Output for code example 1:代码示例 1 的输出：

[[1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1. 0. 0.]]

Example code 2 showing the 'auto' option for specification of categories:示例代码 2 显示了用于指定类别的 'auto' 选项：

The first 3 columns encode the country names, the last four the personal names.前 3 列对国家名称进行编码，后四列对个人名称进行编码。

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

ohe = OneHotEncoder(categories='auto')
X = ohe.fit_transform(X).toarray()

print (X)

Output for code example 2 (same as for 1):代码示例 2 的输出（与 1 相同）：

[[1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1. 0. 0.]]

Example code 3 where only the first column is one hot encoded:示例代码 3，其中只有第一列是一个热编码：

Now, here's the unique part.现在，这是独特的部分。 What if you only need to One Hot Encode a specific column for your data?如果您只需要对数据的特定列进行一次热编码怎么办？

( Note: I've left the last column as strings for easier illustration. In reality it makes more sense to do this WHEN the last column was already numerical). （注意：为了便于说明，我将最后一列保留为字符串。实际上，当最后一列已经是数字时，这样做更有意义）。

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

countries = np.unique(X[:,0])
names = np.unique(X[:,1])

ohe = OneHotEncoder(categories=[countries]) # specify ONLY unique country names
tmp = ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()

X = np.append(tmp, names.reshape(-1,1), axis=1)

print (X)

Output for code example 3:代码示例 3 的输出：

[[1.0 0.0 0.0 'Pechi']
 [0.0 0.0 1.0 'Raj']
 [0.0 1.0 0.0 'Sri']
 [0.0 0.0 1.0 'Vignesh']]

Answer 2

Below implementation should work well.下面的实现应该运行良好。 Note that the input of onehotencoder fit_transform must not be 1-rank array and also output is sparse and we have used to_array() to expand it.请注意，onehotencoder fit_transform的输入不能是 1-rank 数组，并且输出也是稀疏的，我们已经使用to_array()来扩展它。

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]


df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

le = LabelEncoder()
X_num = le.fit_transform(X[:,0]).reshape(-1,1)

ohe = OneHotEncoder()
X_num = ohe.fit_transform(X_num)

print (X_num.toarray())

X[:,0] = X_num

print (X)

Answer 3

An alternative if you do want to encode multiple categorical features is to use a Pipeline with a FeatureUnion and a couple custom Transformers.如果您确实想对多个分类特征进行编码，另一种方法是使用带有 FeatureUnion 和几个自定义转换器的管道。

First need two transformers - one for selecting a single column and one for making LabelEncoder usable in a Pipeline (The fit_transform method only takes X, it needs to take an optional y to work in a Pipeline).首先需要两个转换器 - 一个用于选择单个列，另一个用于使 LabelEncoder 在管道中可用（fit_transform 方法只需要 X，它需要一个可选的 y 才能在管道中工作）。

from sklearn.base import BaseEstimator, TransformerMixin

class SingleColumnSelector(TransformerMixin, BaseEstimator):
    def __init__(self, column):
        self.column = column

    def transform(self, X, y=None):
        return X[:, self.column].reshape(-1, 1)

    def fit(self, X, y=None):
        return self

class PipelineAwareLabelEncoder(TransformerMixin, BaseEstimator):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return LabelEncoder().fit_transform(X).reshape(-1, 1)

Next create a Pipeline (or just a FeatureUnion) which has 2 branches - one for each of the categorical columns.接下来创建一个管道（或只是一个 FeatureUnion），它有 2 个分支 - 每个分类列一个。 Within each select 1 column, encode the labels and then one hot encode.在每个选择 1 列中，对标签进行编码，然后进行一次热编码。

import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion

pipeline = Pipeline([(
    'encoded_features',
    FeatureUnion([('countries',
        make_pipeline(
            SingleColumnSelector(0),
            PipelineAwareLabelEncoder(),
            OneHotEncoder()
        )), 
        ('names', make_pipeline(
            SingleColumnSelector(1),
            PipelineAwareLabelEncoder(),
            OneHotEncoder()
        ))
    ]))
])

Finally run your full dataframe through the Pipeline - it will one hot encode each column separately and concatenate at the end.最后通过管道运行完整的数据帧 - 它将分别对每一列进行热编码并在最后连接。

df = pd.DataFrame([["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]], columns=['Country', 'Name'])
X = df.values
transformed_X = pipeline.fit_transform(X)
print(transformed_X.toarray())

Which returns (first 3 columns are the countries, second 4 are the names)哪个返回（前 3 列是国家/地区，后 4 列是名称）

[[ 1.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  1.  0.  0.]]

Answer 4

to make the long story short, if you are looking to dummify your df, use dummy=pd.get_dummies as:长话短说，如果您想将 df dummy=pd.get_dummies ，请使用dummy=pd.get_dummies为：

dummy=pd.get_dummies(df['str'])
df=pd.concat([df,dummy], axis=1)
print(Data)

如何在 Sklearn 中执行 OneHotEncoding，获取值错误

问题描述

4 个解决方案

解决方案1
6 已采纳 2018-10-12 04:54:01

Example code 1 where ALL columns are encoded and where the categories are explicitly specified:示例代码 1，其中对所有列进行编码并明确指定类别：

Output for code example 1:代码示例 1 的输出：

Example code 2 showing the 'auto' option for specification of categories:示例代码 2 显示了用于指定类别的 'auto' 选项：

Output for code example 2 (same as for 1):代码示例 2 的输出（与 1 相同）：

Example code 3 where only the first column is one hot encoded:示例代码 3，其中只有第一列是一个热编码：

Output for code example 3:代码示例 3 的输出：

解决方案2
3 2017-12-13 11:19:08

解决方案3
3 2017-12-13 12:09:06

解决方案4
0 2020-11-06 17:56:40

如何在 Sklearn 中执行 OneHotEncoding，获取值错误

问题描述

4 个解决方案

解决方案1 6 已采纳 2018-10-12 04:54:01

Example code 1 where ALL columns are encoded and where the categories are explicitly specified:示例代码 1，其中对所有列进行编码并明确指定类别：

Output for code example 1:代码示例 1 的输出：

Example code 2 showing the 'auto' option for specification of categories:示例代码 2 显示了用于指定类别的 'auto' 选项：

Output for code example 2 (same as for 1):代码示例 2 的输出（与 1 相同）：

Example code 3 where only the first column is one hot encoded:示例代码 3，其中只有第一列是一个热编码：

Output for code example 3:代码示例 3 的输出：

解决方案2 3 2017-12-13 11:19:08

解决方案3 3 2017-12-13 12:09:06

解决方案4 0 2020-11-06 17:56:40

解决方案1
6 已采纳 2018-10-12 04:54:01

解决方案2
3 2017-12-13 11:19:08

解决方案3
3 2017-12-13 12:09:06

解决方案4
0 2020-11-06 17:56:40