使用来自sklearn的LabelEncoder和OneHotEncoder编码数据时出现意外问题

Question

I am encoding some data to pass into an ML model using the LabelEncoder and OneHotEncoder from sklearn however I am getting an error back that relates to a column I that I don't think should be being encoded. 我正在使用sklearn的LabelEncoder和OneHotEncoder对一些数据进行编码以传递到ML模型中，但是我收到了与我认为不应该对其进行编码的列有关的错误。

Here is my code; 这是我的代码；

import numpy as np
import pandas as pd
import matplotlib.pyplot as py

Dataset = pd.read_csv('C:\\Users\\taylorr2\\Desktop\\SID Alerts.csv', sep = ',')
X = Dataset.iloc[:,:-1].values
Y = Dataset.iloc[:,18].values

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()

I can only see how I am trying to encode the first column of data however the error I am getting is the following; 我只能看到我如何尝试对数据的第一列进行编码，但是以下是我得到的错误；

onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Traceback (most recent call last):

  File "<ipython-input-132-360fc0133165>", line 2, in <module>
    X = onehotencoder.fit_transform(X).toarray()

  File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-    packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
    self.categorical_features, copy=True)

  File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: could not convert string to float: 'A string that only appears in column 16 or 18 of my data'

What is it about my code that is making it think it needs to try and convert a value in column 16 or 18 into a float and anyway, what should be the issue with doing that!!? 我的代码是什么使它认为需要尝试将第16或18列中的值转换为浮点数，并且这样做到底有什么问题呢？！

Thanks in advance for your advice! 预先感谢您的建议！

Answer 1

I'm sorry, this is actually a comment but due to my reputation I can't post comments yet :( 抱歉，这实际上是一条评论，但是由于我的声誉，我还不能发表评论:(

Probably that string appears on column 17 of your data, and I think it's because for some reason the last columns of the data are checked first (you can try passing less columns (eg 17 by passing X[:,0:17]) to see what happens. It'll complain about the last column again). 该字符串可能出现在数据的第17列上，我认为这是因为出于某种原因，首先检查了数据的最后一列（您可以尝试将较少的列（例如，通过传递X [：，0:17]传递给17）看看会发生什么。它将再次抱怨最后一栏。

Anyway, the input to OneHotEncoder should be a matrix of integers, as described here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html . 无论如何，OneHotEncoder的输入应为整数矩阵，如此处所述： http ://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html。 But I think since you specified the index of the categorical features to OneHotEncoder class, that shouldn't matter anyway (at least I'd expect the non categorical features to be "ignored"). 但是我认为，既然您已将分类特征的索引指定到OneHotEncoder类，则无论如何都没关系（至少我希望非分类特征会被“忽略”）。

Reading the code in 'sklearn/preprocessing/data.py' I've seen that when they do "X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)", they are considering the non categorical features, even though their indexes are passed as argument to the function that calls check_array. 阅读“ sklearn / preprocessing / data.py”中的代码，我发现他们在执行“ X = check_array（X，accept_sparse ='csc'，copy = copy，dtype = FLOAT_DTYPES）“时，他们正在考虑将其归类功能，即使它们的索引作为参数传递给调用check_array的函数。 I don't know, maybe it should be checked with the sklearn community on github? 我不知道，也许应该在github上的sklearn社区检查它？

Answer 2

@Taylrl, @Taylrl，

I encountered the same behavior and found it frustrating. 我遇到了相同的行为，发现它令人沮丧。 As @Vivek pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter. 正如@Vivek指出的那样，Scikit-Learn要求所有数据都是数字数据，然后才考虑选择categorical_features参数中提供的列。

Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is 具体而言，列选择由_transform_selected()方法处理，该方法的第一行是

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES) . X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES) 。

This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float. 如果提供的数据框X中的任何数据无法成功转换为浮点数，则此检查将失败。

I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard. 我同意sklearn.preprocessing.OneHotEncoder的文档在这方面极具误导性。

使用来自sklearn的LabelEncoder和OneHotEncoder编码数据时出现意外问题

问题描述

2 个解决方案

解决方案1
0 2017-03-14 22:41:01

解决方案2
0 2018-02-14 23:58:04

使用来自sklearn的LabelEncoder和OneHotEncoder编码数据时出现意外问题

问题描述

2 个解决方案

解决方案1 0 2017-03-14 22:41:01

解决方案2 0 2018-02-14 23:58:04

解决方案1
0 2017-03-14 22:41:01

解决方案2
0 2018-02-14 23:58:04