[英]Produce unique ids for a list of lists with words
I have list of lists with pairs of words and want to depict words on ids. 我有成对单词的列表列表,并想要在ids上描述单词。 Ids should be from 0 till the len(set(words)).
Ids应该是从0到len(set(words))。 The list now looks like that:
该列表现在看起来像这样:
[['pluripotent', 'Scharte'],
['Halswirbel', 'präventiv'],
['Kleiber', 'Blauspecht'],
['Kleiber', 'Scheidung'],
['Nillenlutscher', 'Salzstangenlecker']]
The result should have the same formats, but with ids instead. 结果应该具有相同的格式,但使用id。 So for example:
例如:
[[0, 1],
[2, 3],
[4, 5],
[4, 6],
[7, 8]]
I have till now this, but it doesn't give me the right output: 我到现在为止,但它没有给我正确的输出:
def words_to_ids(labels):
vocabulary = []
word_to_id = {}
ids = []
for word1,word2 in labels:
vocabulary.append(word1)
vocabulary.append(word2)
for i, word in enumerate(vocabulary):
word_to_id [word] = i
for word1,word2 in labels:
ids.append([word_to_id [word1], word_to_id [word1]])
print(ids)
Output: 输出:
[[0, 0], [2, 2], [6, 6], [6, 6], [8, 8]]
It is repeating ids where there are unique words. 它是重复的ids,其中有独特的单词。
You have two errors. 你有两个错误。 First, you have a simple typo, here:
首先,你有一个简单的拼写错误,在这里:
for word1,word2 in labels:
ids.append([word_to_id [word1], word_to_id [word1]])
You are adding the id for word1
twice , there. 你在那里两次添加
word1
的id。 Correct the second word1
to look up word2
instead. 纠正第二个
word1
来查找word2
。
Next, you are not testing if you have seen a word before, so for 'Kleiber'
you first give it the id 4
, then overwrite that entry with 6
the next iteration. 接下来,你没有测试你之前是否看过一个单词,所以对于
'Kleiber'
你首先给它id 4
,然后在下一次迭代时用6
覆盖那个条目。 You need to give unique words numbers, not all words: 您需要提供唯一的单词数字,而不是所有单词:
counter = 0
for word in vocabulary:
if word not in word_to_id:
word_to_id[word] = counter
counter += 1
or you could simply not add a word to vocabulary
if you already have that word listed. 或者如果你已经列出了这个单词,你就不能在
vocabulary
添加单词。 You don't really need a separate vocabulary
list here, by the way. 顺便说一下,你真的不需要一个单独的
vocabulary
表。 A separate loop doesn't buy you anything, so the following works too: 一个单独的循环不会给你带来任何东西,所以以下工作原理:
word_to_id = {}
counter = 0
for words in labels:
for word in words:
word_to_id [word] = counter
counter += 1
You can simplify your code quite a bit by using a defaultdict
object and itertools.count()
to supply default values: 您可以使用
defaultdict
对象和itertools.count()
提供默认值来简化代码:
from collections import defaultdict
from itertools import count
def words_to_ids(labels):
word_ids = defaultdict(count().__next__)
return [[word_ids[w1], word_ids[w2]] for w1, w2 in labels]
The count()
object gives you the next integer value in a series each time __next__
is called, and defaultdict()
will call that each time you try to access a key that doesn't yet exist in the dictionary. 每次调用
__next__
, count()
对象都会为您提供系列中的下一个整数值,而defaultdict()
将在每次尝试访问字典中尚不存在的键时调用它。 Together, they ensure a unique ID for each unique word. 它们共同确保每个唯一单词的唯一ID。
There are two issues: 有两个问题:
word1
in word_to_id
. word_to_id
重复查找word1
来制作拼写错误。 word_to_id
dictionary you need to consider unique values only. word_to_id
字典时,您只需要考虑唯一值。 For example, in Python 3.7+ you can take advantage of insertion-ordered dictionaries: 例如,在Python 3.7+中,您可以利用插入顺序的词典:
for i, word in enumerate(dict.fromkeys(vocabulary)):
word_to_id[word] = i
for word1, word2 in labels:
ids.append([word_to_id[word1], word_to_id[word2]])
An alternative for versions pre-3.7 is to use collections.OrderedDict
or the itertools
unique_everseen
recipe . 3.7之前版本的替代方法是使用
collections.OrderedDict
或itertools
unique_everseen
配方 。
If there is no ordering requirement, you can just use set(vocabulary)
. 如果没有订购要求,您可以使用
set(vocabulary)
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.