简体   繁体   English

使用来自sklearn的LabelEncoder和OneHotEncoder编码数据时出现意外问题

[英]Unexpected issue when encoding data using LabelEncoder and OneHotEncoder from sklearn

I am encoding some data to pass into an ML model using the LabelEncoder and OneHotEncoder from sklearn however I am getting an error back that relates to a column I that I don't think should be being encoded. 我正在使用sklearn的LabelEncoder和OneHotEncoder对一些数据进行编码以传递到ML模型中,但是我收到了与我认为不应该对其进行编码的列有关的错误。

Here is my code; 这是我的代码;

import numpy as np
import pandas as pd
import matplotlib.pyplot as py

Dataset = pd.read_csv('C:\\Users\\taylorr2\\Desktop\\SID Alerts.csv', sep = ',')
X = Dataset.iloc[:,:-1].values
Y = Dataset.iloc[:,18].values

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()

I can only see how I am trying to encode the first column of data however the error I am getting is the following; 我只能看到我如何尝试对数据的第一列进行编码,但是以下是我得到的错误;

onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Traceback (most recent call last):

  File "<ipython-input-132-360fc0133165>", line 2, in <module>
    X = onehotencoder.fit_transform(X).toarray()

  File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-    packages\sklearn\preprocessing\data.py", line 1902, in fit_transform
    self.categorical_features, copy=True)

  File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\preprocessing\data.py", line 1697, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "C:\Users\taylorr2\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: could not convert string to float: 'A string that only appears in column 16 or 18 of my data'

What is it about my code that is making it think it needs to try and convert a value in column 16 or 18 into a float and anyway, what should be the issue with doing that!!? 我的代码是什么使它认为需要尝试将第16或18列中的值转换为浮点数,并且这样做到底有什么问题呢?!

Thanks in advance for your advice! 预先感谢您的建议!

I'm sorry, this is actually a comment but due to my reputation I can't post comments yet :( 抱歉,这实际上是一条评论,但是由于我的声誉,我还不能发表评论:(

Probably that string appears on column 17 of your data, and I think it's because for some reason the last columns of the data are checked first (you can try passing less columns (eg 17 by passing X[:,0:17]) to see what happens. It'll complain about the last column again). 该字符串可能出现在数据的第17列上,我认为这是因为出于某种原因,首先检查了数据的最后一列(您可以尝试将较少的列(例如,通过传递X [:,0:17]传递给17)看看会发生什么。它将再次抱怨最后一栏。

Anyway, the input to OneHotEncoder should be a matrix of integers, as described here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html . 无论如何,OneHotEncoder的输入应为整数矩阵,如此处所述: http ://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html。 But I think since you specified the index of the categorical features to OneHotEncoder class, that shouldn't matter anyway (at least I'd expect the non categorical features to be "ignored"). 但是我认为,既然您已将分类特征的索引指定到OneHotEncoder类,则无论如何都没关系(至少我希望非分类特征会被“忽略”)。

Reading the code in 'sklearn/preprocessing/data.py' I've seen that when they do "X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)", they are considering the non categorical features, even though their indexes are passed as argument to the function that calls check_array. 阅读“ sklearn / preprocessing / data.py”中的代码,我发现他们在执行“ X = check_array(X,accept_sparse ='csc',copy = copy,dtype = FLOAT_DTYPES)“时,他们正在考虑将其归类功能,即使它们的索引作为参数传递给调用check_array的函数。 I don't know, maybe it should be checked with the sklearn community on github? 我不知道,也许应该在github上的sklearn社区检查它?

@Taylrl, @Taylrl,

I encountered the same behavior and found it frustrating. 我遇到了相同的行为,发现它令人沮丧。 As @Vivek pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter. 正如@Vivek指出的那样,Scikit-Learn要求所有数据都是数字数据,然后才考虑选择categorical_features参数中提供的列。

Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is 具体而言,列选择由_transform_selected()方法处理,该方法的第一行是

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES) . X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float. 如果提供的数据框X中的任何数据无法成功转换为浮点数,则此检查将失败。

I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard. 我同意sklearn.preprocessing.OneHotEncoder的文档在这方面极具误导性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用来自 sklearn 的 OneHotEncoder 的问题 - Problems using OneHotEncoder from sklearn 在Anaconda中更新软件包后,“从sklearn.preprocessing导入LabelEncoder,OneHotEncoder”失败 - “from sklearn.preprocessing import LabelEncoder, OneHotEncoder” fails after update of packages in Anaconda Sklearn Labelencoder 在编码新 dataframe 时保留编码值 - Sklearn Labelencoder keep encoded values when encoding new dataframe 使用 sklearn.preprocessing.LabelEncoder() 使用 Python 编码多个分类数据在 2D 数组输入上需要太多处理时间 - Encoding Multiple Categorical Data with Python using sklearn.preprocessing.LabelEncoder() takes too much processing time on 2D array inputs 使用 sklearn LabelEncoder 将 label 绑定到给定的编码 - bind a label to a given encoding with sklearn LabelEncoder Python sklearn - 确定LabelEncoder的编码顺序 - Python sklearn - Determine the encoding order of LabelEncoder 在for循环中使用Sklearn的LabelEncoder错误 - LabelEncoder error using Sklearn in a for loop 使用 sklearn OneHotEncoder 时如何忽略数字列? - How to leave numerical columns out when using sklearn OneHotEncoder? 来自 sklearn 的 OneHotEncoder 在传递类别时会给出 ValueError - OneHotEncoder from sklearn gives a ValueError when passing categories labelencoder和OneHotEncoder的值错误 - Value error with labelencoder and OneHotEncoder
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM