简体   繁体   English

sklearn LabelEncoder中的标签不一致?

[英]Inconsistent labeling in sklearn LabelEncoder?

I have applied a LabelEncoder() on a dataframe, which returns the following: 我在数据帧上应用了LabelEncoder() ,它返回以下内容:

在此输入图像描述

The order/new_cart s have different label-encoded numbers, like 70, 64, 71, etc order/new_cart具有不同的标签编码数字,如order/new_cart 70, 64, 71, etc

Is this inconsistent labeling, or did I do something wrong somewhere? 这是不一致的标签,还是我在某处做错了什么?

LabelEncoder works on one-dimensional arrays. LabelEncoder适用于一维数组。 If you apply it to multiple columns, it will be consistent within columns but not across columns. 如果将其应用于多个列,则它将在列中保持一致,但不能跨列。

As a workaround, you can convert the dataframe to a one dimensional array and call LabelEncoder on that array. 作为解决方法,您可以将数据帧转换为一维数组并在该数组上调用LabelEncoder。

Assume this is the dataframe: 假设这是数据帧:

df
Out[372]: 
   0  1  2
0  d  d  a
1  c  a  c
2  c  c  b
3  e  e  d
4  d  d  e
5  d  b  e
6  e  e  b
7  a  e  b
8  b  c  c
9  e  a  b

With ravel and then reshaping: 用ravel然后重塑:

pd.DataFrame(LabelEncoder().fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Out[373]: 
   0  1  2
0  3  3  0
1  2  0  2
2  2  2  1
3  4  4  3
4  3  3  4
5  3  1  4
6  4  4  1
7  0  4  1
8  1  2  2
9  4  0  1

Edit: 编辑:

If you want to store the labels, you need to save the LabelEncoder object. 如果要存储标签,则需要保存LabelEncoder对象。

le = LabelEncoder()
df2 = pd.DataFrame(le.fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)

Now, le.classes_ gives you the classes (starting from 0). 现在, le.classes_为您提供了类(从0开始)。

le.classes_
Out[390]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)

If you want to access the integer by label, you can construct a dict: 如果要按标签访问整数,可以构造一个dict:

dict(zip(le.classes_, np.arange(len(le.classes_))))
Out[388]: {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

You can do the same with transform method, without building a dict: 您可以使用transform方法执行相同操作,而无需构建dict:

le.transform('c')
Out[395]: 2

Your LabelEncoder object is being re-fit to each column of your DataFrame. 您的LabelEncoder对象正在重新适合您的DataFrame的每一列。

Because of the way the apply and fit_transform functions work, you are accidentally calling the fit function on each column of your frame. 由于applyfit_transform函数的工作方式,您不小心在框架的每一列上调用了fit函数。 Let's walk through whats happening in the following line: 让我们一起来看看以下行中发生的事情:

labeled_df = String_df.apply(LabelEncoder().fit_transform)
  1. create a new LabelEncoder object 创建一个新的LabelEncoder对象
  2. Call apply passing in the fit_transform method. fit_transform方法中调用apply传递。 For each column in your DataFrame it will call fit_transform on your encoder passing in the column as an argument. 对于DataFrame每一列, fit_transform在编码器上调用fit_transform作为参数传入列。 This does two things: 这有两件事:
    A. refit your encoder (modifying its state) B. return the codes for the elements of your column based on your encoders new fitting. A.重新安装编码器(修改其状态)B。根据您的编码器新配件返回列的元素代码。

The codes will not be consistent across columns because each time you call fit_transform the LabelEncoder object can choose new transformation codes. 代码在列之间不一致,因为每次调用fit_transform时,LabelEncoder对象都可以选择新的转换代码。

If you want your codes to be consistent across columns, you should fit your LabelEncoder to your whole dataset. 如果希望代码在列之间保持一致,则应将LabelEncoder与整个数据集相匹配。

Then pass the transform function to your apply function, instead of the fit_transform function. 然后将transform函数传递给apply函数,而不是fit_transform函数。 You can try the following: 您可以尝试以下方法:

encoder = LabelEncoder()
all_values = String_df.values.ravel() #convert the dataframe to one long array
encoder.fit(all_values)
labeled_df = String_df.apply(encoder.transform)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM