简体   繁体   English

在sklearn中使用OneHotEncoding编码32位十六进制数字

[英]Encoding 32bit hex numbers using OneHotEncoding in sklearn

I have some categorical features hashed into 32bit hex numbers, for example, in one category ,the three different classes are hashed into: 我具有一些分类为32位十六进制数字的分类功能,例如,在一类中,三个不同的类被分类为:

'05db9164'  '68fd1e64' '8cf07265'

One Hot Encoding map these into a binary array, and only one bit is 1, the other is 0. So if I want to encoding the above features. 一个Hot Encoding将它们映射到二进制数组中,只有一位是1,另一位是0。因此,如果我想对上述功能进行编码。 Only need three bits. 只需要三位。

001 correspond to 05db9164, 010 correspond to 68fd1e64, 100 correspond to 8cf07265

But when I use OneHotEncoder in sklearn, which tell me that the number is too large. 但是,当我在sklearn中使用OneHotEncoder时,它告诉我该数字太大。 this confused me. 这让我感到困惑。 because we don't care the numerical property of the number. 因为我们不在乎数字的数字属性。 we only care about they are the same or not. 我们只关心它们是否相同。

On the other hand, if i encoding 0,1,2: 另一方面,如果我编码0,1,2:

enc = OneHotEncoder()
enc.fit([[0],[1],[2]])

print enc.transform([[0]]).toarray()
print enc.transform([[1]]).toarray()
print enc.transform([[2]]).toarray()

I have got the expected answer. 我已经得到了预期的答案。 And I think these 32bit hex number is used to indicate the class in the category. 而且我认为这些32位十六进制数字用于指示类别中的类别。 so it it the same as 0 , 1 ,2. 因此它与0,1,2相同。 and [0,0,1], [0,1,0],[1,0,0] is enough to encoding it. 并且[0,0,1],[0,1,0],[1,0,0]足以对其进行编码。 Could you please help me .thanks very much. 你能帮我吗,非常感谢。

If your array is not extremely long, you can rename the features using np.unique . 如果数组不是很长,则可以使用np.unique重命名功能。 That way you can also determine the maximal number of different features, which in return you can feed to the OneHotEncoder , so that it know how many columns to allocate. 这样,您还可以确定不同功能的最大数量,作为回报,您可以将其馈送到OneHotEncoder ,以便它知道要分配多少列。 Note that the renaming is not per se necessary, but it has the nice side effect of generating integers which use less space (if you use np.int32 ). 请注意,重命名本身不是必需的,但是它具有生成较小空间(如果使用np.int32 )的整数的良好副作用。

import numpy as np
rng = np.random.RandomState(42)
# generate some data
data = np.array(['05db9164', '68fd1e64', '8cf07265'])[rng.randint(0, 3, 100)]

uniques, new_labels = np.unique(data, return_inverse=True)
n_values = len(uniques)

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(n_values=n_values)
encoded = encoder.fit_transform(new_labels[:, np.newaxis])

print repr(encoded)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM