根据字典值生成一种热编码

Question

I was trying to make a one hot array based on my dictionary characters: First, I created a numpy zeros that has row X column (3x7) and then I search for the id of each character and assign "1" to each row of the numpy array. 我试图根据字典字符创建一个热数组：首先，我创建了一个具有行X列（3x7）的numpy零，然后搜索每个字符的ID并将“ 1”分配给numpy数组。

My goal is to assign each character with one hot array. 我的目标是为每个字符分配一个热阵列。 "1" as "present" and "0" as "not present". “ 1”表示“存在”，“ 0”表示“不存在”。 Here we have 3 characters so we should have 3 rows, while the 7 columns serve as the characters existence in the dictionary. 这里我们有3个字符，所以我们应该有3行，而7列用作字典中的字符。

However, I received an error stating that "TypeError: only integer scalar arrays can be converted to a scalar index". 但是，我收到一条错误消息，指出“ TypeError：只有整数标量数组可以转换为标量索引”。 Can anyone please help me in this? 有人可以帮我吗？ Thank you 谢谢

In order not to make everyone misunderstand my dictionary: 为了不让大家误解我的字典：

Here is how I create the dic: 这是我创建dic的方法：

sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}

My code: 我的代码：

import numpy as np
sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
   a[xx] = aa[xx]
a = {"b":1, "c":2, "e":4}
aa =len(a)

for x,y in a.items():
    aa = np.zeros((aa,aaa))
    aa[y] = 1

print(aa)

Current Error: 当前错误：

TypeError: only integer scalar arrays can be converted to a scalar index

My expected output: 我的预期输出：

[[0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]]

-------> Since its dictionary so the index arrangement should be different and the "1"s within the array is a dummy so that I can show my expected output. ------->因为它的字典，所以索引排列应该不同，并且数组中的“ 1”是一个虚拟对象，这样我就可以显示期望的输出。

Answer 1

Setting indices 设定索引

(Comments inlined.) （内联注释。）

# Sort and extract the indices.
idx = sorted(a.values())
# Initialise a matrix of zeros.
aa = np.zeros((len(idx), max(idx) + 1))
# Assign 1 to appropriate indices.
aa[np.arange(len(aa)), idx] = 1

print (aa)
array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.]])

`numpy.eye`

idx = sorted(a.values())
eye = np.eye(max(idx) + 1)    
aa = eye[idx]

print (aa)
array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.]])

Answer 2

A one hot encoding treats a sample as a sequence, where each element of the sequence is the index into a vocabulary indicating whether that element (like a word or letter) is in the sample. 一种热编码将样本视为序列，其中序列的每个元素都是词汇表中的索引，指示该元素（例如单词还是字母）是否在样本中。 For example if your vocabulary was the lower-case alphabet, a one-hot encoding of the work cat might look like: 例如，如果您的词汇是小写字母，那么工作猫的一键编码可能看起来像：

 [1, 0., 1, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 1, 0., 0., 0., 0., 0., 0.]

Indicating that this word contains the letters c , a , and t . 指示此单词包含字母c ， a和t 。

To make a one-hot encoding you need two things a vocabulary lookup with all the possible values (when using words this is why the matrices can get so large because the vocabulary is huge!). 要进行一键编码，您需要做两件事，即查找具有所有可能值的词汇表（使用单词时，这就是为什么矩阵会变得如此大的原因，因为词汇量很大！）。 But if encoding the lower-case alphabet you only need 26. 但是，如果对小写字母进行编码，则只需26。

Then you typically represent your samples as indexes in the vocabulary. 然后，您通常将样本表示为词汇表中的索引。 So the set of words might look like this: 因此这组单词可能看起来像这样：

#bag, cab, fad
sentences = np.array([[1, 0, 6], [2, 0, 1], [5, 0, 3]])

When you one-hot encode that you will get a matrix 3 x 26: 当您进行一次热编码时，您将获得3 x 26的矩阵：

vocab = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

#bag, cab, fad
sentences = np.array([[1, 0, 6], [2, 0, 1], [5, 0, 3]])

def onHot(sequences, dimension=len(vocab)):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
      results[i, sequence] = 1
    return results

onHot(sentences)

Which results in thee one-hot encoded samples with a 26 letter vocabulary ready to be fed to a neural network: 这样就产生了一个带有26个字母的词汇表的单热编码样本，准备将其馈送到神经网络：

array([[1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Answer 3

My solution and for future readers: 我的解决方案以及未来的读者：

I build the dictionary for the "sent" list: 我为“已发送”列表构建字典：

sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}

Then I find the indices for my own sentences based on the dictionary and assigned the numerical values to these sentences. 然后我根据字典找到自己句子的索引，并为这些句子分配数值。

import numpy as np
sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
   a[xx] = aa[xx]
a = {"b":1, "c":2, "e":4}
aa =len(a)

I extract the indices from the new assignment of "a": 我从“ a”的新赋值中提取索引：

index = []
for x,y in a.items():
    index.append(y)

Then I create another numpy array for these extract indices from the a. 然后，我为a中的这些提取索引创建另一个numpy数组。

index = np.asarray(index)

Now I create numpy zeros to store the existence of each character: 现在，我创建numpy零以存储每个字符的存在：

new = np.zeros((aa,aaa))
new[np.arange(aa), index] = 1

print(new) 打印（新）

Output: 输出：

[[0. 1. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]]

Answer 4

Here is another one by using sklearn.preprocessing 这是使用sklearn.preprocessing的另一个

The lines are quite long and not much difference. 线很长，差别不大。 I don:t know why but produced a similar results. 我不知道为什么，但产生了类似的结果。

import numpy as np
from sklearn.preprocessing import OneHotEncoder
sent = ["a", "b", "c", "d", "e", "f", "g"]
aaa = len(sent)
aa = {x:i for i,x in enumerate(sent)}


sentences = ["b", "c", "e"]
a = {}
for xx in sentences:
   a[xx] = aa[xx]
a = {"a":0, "b":1, "c":2, "d":3, "e":4, "f":5, "g":6}
aa =len(a)

index = []
for x,y in a.items():
    index.append([y])

index = np.asarray(index)

enc = OneHotEncoder()
enc.fit(index)

print(enc.transform([[1], [2], [4]]).toarray())

Output 输出量

[[0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]]

Answer 5

I like to use a LabelEncoder with a OneHotEncoder from sklearn . 我喜欢将OneHotEncoder的LabelEncoder与OneHotEncoder sklearn 。

import sklearn.preprocessing
import numpy as np

texty_data = np.array(["a", "c", "b"])
le = sklearn.preprocessing.LabelEncoder().fit(texty_data)
integery_data = le.transform(texty_data)
ohe = sklearn.preprocessing.OneHotEncoder().fit(integery_data.reshape((-1,1)))
onehot_data = ohe.transform(integery_data.reshape((-1,1)))

Stores it sparse, so that's handy. 存储稀疏，所以很方便。 You can also use a LabelBinarizer to streamline this: 您还可以使用LabelBinarizer简化此过程：

import sklearn.preprocessing
import numpy as np

texty_data = np.array(["a", "c", "b"])
lb = sklearn.preprocessing.LabelBinarizer().fit(texty_data)
onehot_data = lb.transform(texty_data)
print(onehot_data, lb.inverse_transform(onehot_data))

根据字典值生成一种热编码

问题描述

5 个解决方案

解决方案1
3 已采纳 2018-09-14 01:38:25

Setting indices 设定索引

`numpy.eye`

解决方案2
2 2018-09-14 01:56:41

解决方案3
1 2018-09-14 01:58:10

解决方案4
1 2018-09-14 04:57:28

解决方案5
0 2018-09-14 04:33:36

根据字典值生成一种热编码

问题描述

5 个解决方案

解决方案1 3 已采纳 2018-09-14 01:38:25

Setting indices 设定索引

numpy.eye

解决方案2 2 2018-09-14 01:56:41

解决方案3 1 2018-09-14 01:58:10

解决方案4 1 2018-09-14 04:57:28

解决方案5 0 2018-09-14 04:33:36

解决方案1
3 已采纳 2018-09-14 01:38:25

`numpy.eye`

解决方案2
2 2018-09-14 01:56:41

解决方案3
1 2018-09-14 01:58:10

解决方案4
1 2018-09-14 04:57:28

解决方案5
0 2018-09-14 04:33:36