如何使用tensorflow，contrib.learn向现有词汇表添加单词？

Question

I was working with the tensorflow vocabulary, imported like this: 我正在使用tensorflow词汇表，导入如下：

from tensorflow.contrib import learn
vocabulary = learn.preprocessing.VocabularyProcessor(length)

I wrote a unit test that made sure that I could save the vocabulary, reload it, and fit a new sentence while still keeping track of the old one. 我写了一个单元测试，确保我可以保存词汇，重新加载它，并在保持跟踪旧句子的同时适应新句子。

This were my results: 这是我的结果：

The fit sentence:  [1 2 3 4 5 6 2 7 8 4 5 9 7]
The new fit sentence:  [0 0 0 2 9 0 6 2 7 8 4 0 0]

It worked correctly, the word in position 0 ( processed as 2 ) in the first sentence has the same value ( 2 ) like the word in position 3 in the second sentence, because they are the same. 它工作正常，第一个句子中位置0（处理为2）的单词与第二个句子中位置3中的单词具有相同的值（2），因为它们是相同的。

However, I noticed all the new words were 0. 但是，我注意到所有新单词都是0。

I would have expected my new fit sentence to look like this: 我原本期望我的新句子看起来像这样：

[10 11 12 2 9 10 6 2 7 8 4 12 11]

How can I fix this issue? 我该如何解决这个问题？ How can I make my vocabulary processor learn new words? 如何让词汇处理器学习新单词？

Thank you! 谢谢！

EDIT 1: 编辑1：

This is a stripped down version of my unit test: 这是我的单元测试的精简版：

import numpy as np
from tensorflow.contrib import learn

# A test sentence
test_sentence = "This is a test sentence. It is used to test. sentence, this, used"
test_sentence_len = len(test_sentence.split(" "))

# A vocabulary processor
vocabulary_processor = learn.preprocessing.VocabularyProcessor(test_sentence_len)

# Turning a list of sentences ( [test_sentence] ) into a list of fit test sentences and taking the first one.
fit_test_sentence = np.array(list(vocabulary_processor.fit_transform([test_sentence])))[0]

# We see that "is" ( position 1 ) and "is" ( position 6 ) are the same. They should have the same numeric value
# in the fit array as well
print("The fit sentence: ", fit_test_sentence)
# self.assertEqual(fit_test_sentence[1], fit_test_sentence[6])

initial_fit_sentence = fit_test_sentence

# Now, let's save

vocabulary_processor.save("some/path")

# Now, we load into a different variable

new_vocabulary_processor = learn.preprocessing.VocabularyProcessor.restore("some/path")

new_test_sentence = "Very different uttering is this one. It is used to test."

# Now, we fit the new sentence with the new vocabulary, which should be the old one
# We should see "is" being transformed into the same numerical value, initial_fit_sentence[1]

new_fit_sentence = np.array(list(new_vocabulary_processor.fit_transform([new_test_sentence])))[0]

print("The new fit sentence: ", new_fit_sentence)
# self.assertEqual(initial_fit_sentence[1], new_fit_sentence[3])

I tried changing the values of test_sentence_len thinking maybe the vocabulary just couldn't learn any more new words, but even if i set it to 1000 for example, it won't learn new words. 我尝试改变test_sentence_len的值，可能是词汇只是无法学习更多的新单词，但即使我将它设置为1000，例如，它也不会学习新单词。

Answer 1

It looks like the fit_transform method will freeze the vocabulary. 看起来fit_transform方法会冻结词汇量。 This means that anything that has not been observed until that point will get the 0 ID (UNK). 这意味着在此之前尚未观察到的任何内容都将获得0 ID（UNK）。 You can unfreeze the vocabulary with new_vocabulary_processor.vocabulary_.freeze(False) . 您可以使用new_vocabulary_processor.vocabulary_.freeze(False)解冻词汇表。

new_vocabulary_processor = learn.preprocessing.VocabularyProcessor.restore("some/path")
new_vocabulary_processor.vocabulary_.freeze(False)
new_test_sentence = "Very different uttering is this one. It is used to test."

如何使用tensorflow，contrib.learn向现有词汇表添加单词？

问题描述

1 个解决方案

解决方案1
0 2016-12-14 16:46:59

如何使用tensorflow，contrib.learn向现有词汇表添加单词？

问题描述

1 个解决方案

解决方案1 0 2016-12-14 16:46:59

解决方案1
0 2016-12-14 16:46:59