为包含单词的列表列表生成唯一ID

Question

I have list of lists with pairs of words and want to depict words on ids. 我有成对单词的列表列表，并想要在ids上描述单词。 Ids should be from 0 till the len(set(words)). Ids应该是从0到len（set（words））。 The list now looks like that: 该列表现在看起来像这样：

[['pluripotent', 'Scharte'],
 ['Halswirbel', 'präventiv'],
 ['Kleiber', 'Blauspecht'],
 ['Kleiber', 'Scheidung'],
 ['Nillenlutscher', 'Salzstangenlecker']]

The result should have the same formats, but with ids instead. 结果应该具有相同的格式，但使用id。 So for example: 例如：

[[0, 1],
 [2, 3],
 [4, 5],
 [4, 6],
 [7, 8]]

I have till now this, but it doesn't give me the right output: 我到现在为止，但它没有给我正确的输出：

def words_to_ids(labels):
  vocabulary = []
  word_to_id = {}
  ids = []
  for word1,word2 in labels:
      vocabulary.append(word1)
      vocabulary.append(word2)

  for i, word in enumerate(vocabulary):
      word_to_id [word] = i
  for word1,word2 in labels:
      ids.append([word_to_id [word1], word_to_id [word1]])
  print(ids)

Output: 输出：

[[0, 0], [2, 2], [6, 6], [6, 6], [8, 8]]

It is repeating ids where there are unique words. 它是重复的ids，其中有独特的单词。

Answer 1

You have two errors. 你有两个错误。 First, you have a simple typo, here: 首先，你有一个简单的拼写错误，在这里：

for word1,word2 in labels:
    ids.append([word_to_id [word1], word_to_id [word1]])

You are adding the id for word1 twice , there. 你在那里两次添加word1的id。 Correct the second word1 to look up word2 instead. 纠正第二个word1来查找word2 。

Next, you are not testing if you have seen a word before, so for 'Kleiber' you first give it the id 4 , then overwrite that entry with 6 the next iteration. 接下来，你没有测试你之前是否看过一个单词，所以对于'Kleiber'你首先给它id 4 ，然后在下一次迭代时用6覆盖那个条目。 You need to give unique words numbers, not all words: 您需要提供唯一的单词数字，而不是所有单词：

counter = 0
for word in vocabulary:
    if word not in word_to_id:
        word_to_id[word] = counter
        counter += 1

or you could simply not add a word to vocabulary if you already have that word listed. 或者如果你已经列出了这个单词，你就不能在vocabulary添加单词。 You don't really need a separate vocabulary list here, by the way. 顺便说一下，你真的不需要一个单独的vocabulary表。 A separate loop doesn't buy you anything, so the following works too: 一个单独的循环不会给你带来任何东西，所以以下工作原理：

word_to_id = {}
counter = 0
for words in labels:
    for word in words:
        word_to_id [word] = counter
        counter += 1

You can simplify your code quite a bit by using a defaultdict object and itertools.count() to supply default values: 您可以使用defaultdict对象和itertools.count()提供默认值来简化代码：

from collections import defaultdict
from itertools import count

def words_to_ids(labels):
    word_ids = defaultdict(count().__next__)
    return [[word_ids[w1], word_ids[w2]] for w1, w2 in labels]

The count() object gives you the next integer value in a series each time __next__ is called, and defaultdict() will call that each time you try to access a key that doesn't yet exist in the dictionary. 每次调用__next__ ， count()对象都会为您提供系列中的下一个整数值，而defaultdict()将在每次尝试访问字典中尚不存在的键时调用它。 Together, they ensure a unique ID for each unique word. 它们共同确保每个唯一单词的唯一ID。

Answer 2

There are two issues: 有两个问题：

You made a typo by repeating a lookup of word1 in word_to_id . 您通过在word_to_id重复查找word1来制作拼写错误。
When constructing your word_to_id dictionary you need to consider unique values only. 构造word_to_id字典时，您只需要考虑唯一值。

For example, in Python 3.7+ you can take advantage of insertion-ordered dictionaries: 例如，在Python 3.7+中，您可以利用插入顺序的词典：

for i, word in enumerate(dict.fromkeys(vocabulary)):
    word_to_id[word] = i

for word1, word2 in labels:
    ids.append([word_to_id[word1], word_to_id[word2]])

An alternative for versions pre-3.7 is to use collections.OrderedDict or the itertools unique_everseen recipe . 3.7之前版本的替代方法是使用collections.OrderedDict或itertools unique_everseen配方。

If there is no ordering requirement, you can just use set(vocabulary) . 如果没有订购要求，您可以使用set(vocabulary) 。

为包含单词的列表列表生成唯一ID

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-01-16 16:36:07

解决方案2
1 2019-01-16 16:44:28

为包含单词的列表列表生成唯一ID

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-01-16 16:36:07

解决方案2 1 2019-01-16 16:44:28

解决方案1
2 已采纳 2019-01-16 16:36:07

解决方案2
1 2019-01-16 16:44:28