[英]Inconsistent labeling in sklearn LabelEncoder?
I have applied a LabelEncoder()
on a dataframe, which returns the following: 我在数据帧上应用了LabelEncoder()
,它返回以下内容:
The order/new_cart
s have different label-encoded numbers, like 70, 64, 71, etc
order/new_cart
具有不同的标签编码数字,如order/new_cart
70, 64, 71, etc
Is this inconsistent labeling, or did I do something wrong somewhere? 这是不一致的标签,还是我在某处做错了什么?
LabelEncoder works on one-dimensional arrays. LabelEncoder适用于一维数组。 If you apply it to multiple columns, it will be consistent within columns but not across columns. 如果将其应用于多个列,则它将在列中保持一致,但不能跨列。
As a workaround, you can convert the dataframe to a one dimensional array and call LabelEncoder on that array. 作为解决方法,您可以将数据帧转换为一维数组并在该数组上调用LabelEncoder。
Assume this is the dataframe: 假设这是数据帧:
df
Out[372]:
0 1 2
0 d d a
1 c a c
2 c c b
3 e e d
4 d d e
5 d b e
6 e e b
7 a e b
8 b c c
9 e a b
With ravel and then reshaping: 用ravel然后重塑:
pd.DataFrame(LabelEncoder().fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Out[373]:
0 1 2
0 3 3 0
1 2 0 2
2 2 2 1
3 4 4 3
4 3 3 4
5 3 1 4
6 4 4 1
7 0 4 1
8 1 2 2
9 4 0 1
Edit: 编辑:
If you want to store the labels, you need to save the LabelEncoder object. 如果要存储标签,则需要保存LabelEncoder对象。
le = LabelEncoder()
df2 = pd.DataFrame(le.fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Now, le.classes_
gives you the classes (starting from 0). 现在, le.classes_
为您提供了类(从0开始)。
le.classes_
Out[390]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)
If you want to access the integer by label, you can construct a dict: 如果要按标签访问整数,可以构造一个dict:
dict(zip(le.classes_, np.arange(len(le.classes_))))
Out[388]: {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
You can do the same with transform method, without building a dict: 您可以使用transform方法执行相同操作,而无需构建dict:
le.transform('c')
Out[395]: 2
Because of the way the apply and fit_transform functions work, you are accidentally calling the fit function on each column of your frame. 由于apply和fit_transform函数的工作方式,您不小心在框架的每一列上调用了fit函数。 Let's walk through whats happening in the following line: 让我们一起来看看以下行中发生的事情:
labeled_df = String_df.apply(LabelEncoder().fit_transform)
LabelEncoder
object 创建一个新的LabelEncoder
对象 apply
passing in the fit_transform
method. 在fit_transform
方法中调用apply
传递。 For each column in your DataFrame
it will call fit_transform
on your encoder passing in the column as an argument. 对于DataFrame
每一列, fit_transform
在编码器上调用fit_transform
作为参数传入列。 This does two things: 这有两件事: The codes will not be consistent across columns because each time you call fit_transform the LabelEncoder object can choose new transformation codes. 代码在列之间不一致,因为每次调用fit_transform时,LabelEncoder对象都可以选择新的转换代码。
Then pass the transform function to your apply function, instead of the fit_transform function. 然后将transform函数传递给apply函数,而不是fit_transform函数。 You can try the following: 您可以尝试以下方法:
encoder = LabelEncoder()
all_values = String_df.values.ravel() #convert the dataframe to one long array
encoder.fit(all_values)
labeled_df = String_df.apply(encoder.transform)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.