labelEncoder 在 sklearn 中的工作

Question

Say I have the following input feature:假设我有以下输入功能：

hotel_id = [1, 2, 3, 2, 3]

This is a categorical feature with numeric values.这是一个带有数值的分类特征。 If I give it to the model as it is, the model will treat it as continuous variable, ie., 2 > 1.如果我将它按原样提供给模型，模型会将其视为连续变量，即 2 > 1。

If I apply sklearn.labelEncoder() then I will get:如果我应用sklearn.labelEncoder()那么我会得到：

hotel_id = [0, 1, 2, 1, 2]

So this encoded feature is considered as continuous or categorical?那么这个编码特征被认为是连续的还是分类的？ If it is treated as continuous then whats the use of labelEncoder().如果它被视为连续的，那么 labelEncoder() 的用途是什么。

PS I know about one hot encoding. PS我知道一种热编码。 But there are around 100 hotel_ids so dont want to use it.但是大约有 100 个 hotel_id，所以不想使用它。 Thanks谢谢

Answer 1

The LabelEncoder is a way to encode class levels. LabelEncoder是一种对类级别进行编码的方法。 In addition to the integer example you've included, consider the following example:除了您包含的整数示例之外，请考虑以下示例：

>>> from sklearn.preprocessing import LabelEncoder
>>> le = LabelEncoder()
>>>
>>> train = ["paris", "paris", "tokyo", "amsterdam"]
>>> test = ["tokyo", "tokyo", "paris"]
>>> le.fit(train).transform(test)
array([2, 2, 1]...)

What the LabelEncoder allows us to do, then, is to assign ordinal levels to categorical data. LabelEncoder允许我们做的是为分类数据分配有序级别。 However , what you've noted is correct: namely, the [2, 2, 1] is treated as numeric data.但是，您注意到的是正确的：即[2, 2, 1]被视为数字数据。 This is a good candidate for using the OneHotEncoder for dummy variables (which I know you said you were hoping not to use).这是将OneHotEncoder用于虚拟变量的一个很好的候选者（我知道你说过你不希望使用它）。

Note that the LabelEncoder must be used prior to one-hot encoding, as the OneHotEncoder cannot handle categorical data.请注意，必须在单热编码之前使用LabelEncoder ，因为OneHotEncoder无法处理分类数据。 Therefore, it is frequently used as pre-cursor to one-hot encoding.因此，它经常被用作 one-hot 编码的前驱。

Alternatively, it can encode your target into a usable array.或者，它可以将您的目标编码为可用数组。 If, for instance, train were your target for classification, you would need a LabelEncoder to use it as your y variable.例如，如果train是您的分类目标，您将需要一个LabelEncoder将其用作您的 y 变量。

Answer 2

If you are running a classification model then the labels are treated as classes and the order is ignored.如果您正在运行分类模型，则标签将被视为类，而顺序将被忽略。 You don't need to onehot.你不需要onehot。

Answer 3

A way to handle this problem is to change your numbers to label with package inflect处理此问题的一种方法是将您的数字更改为带有包装变形的标签

So I have been visiting all numbers of hotels id's and I have changed them into words for example 1 -> 'one' and 2 -> 'two' ... 99 -> 'ninety-nine'因此，我一直在访问所有数量的酒店 ID，并将它们更改为单词，例如 1 -> 'one' 和 2 -> 'two' ... 99 -> '99'

import inflect
p = inflect.engine()

def toNominal(df,column):
for index, row in df.iterrows():
    df.loc[index, column] =  p.number_to_words(df.loc[index, column])

toNominal(df, 'hotel_id')

labelEncoder 在 sklearn 中的工作

问题描述

3 个解决方案

解决方案1
16 2017-01-20 23:38:26

解决方案2
0 2017-01-21 00:52:04

解决方案3
0 2018-01-20 20:42:07

labelEncoder 在 sklearn 中的工作

问题描述

3 个解决方案

解决方案1 16 2017-01-20 23:38:26

解决方案2 0 2017-01-21 00:52:04

解决方案3 0 2018-01-20 20:42:07

解决方案1
16 2017-01-20 23:38:26

解决方案2
0 2017-01-21 00:52:04

解决方案3
0 2018-01-20 20:42:07