[Word2Vec][gensim] 使用参数 min_count 处理词汇表中缺失的单词

Question

已经就这个话题提出了一些类似的问题，但我对到目前为止的答复并不满意； 请先原谅我。

我使用的功能Word2Vec从Python库gensim 。

我的问题是，只要我将参数min_count设置为大于 1 ，我就无法在语料库的每个单词上运行我的模型。 有人会说这是逻辑，因为我选择忽略只出现一次的单词。 但是该函数的行为很奇怪，因为它给出了一个错误，说单词“blabla”不在词汇表中，而这正是我想要的（我希望这个词不在词汇表中）。

我可以想象这不是很清楚，然后在下面找到一个可重现的示例：

import gensim
from gensim.models import Word2Vec

# My corpus
corpus=[["paris","not","great","city"],
       ["praha","better","great","than","paris"],
       ["praha","not","country"]]

# Load a pre-trained model - The orignal one based on google news 
model_google = gensim.models.KeyedVectors.load_word2vec_format(r'GoogleNews-vectors-negative300.bin', binary=True)

# Initializing our model and upgrading it with Google's 
my_model = Word2Vec(size=300, min_count=2)#with min_count=1, everything works fine
my_model.build_vocab(corpus)
total_examples = my_model.corpus_count
my_model.build_vocab([list(model_google.vocab.keys())], update=True)
my_model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, lockf=1.0)
my_model.train(corpus, total_examples=total_examples, epochs=my_model.iter)

# Show examples
print(my_model['paris'][0:10])#works cause 'paris' is present twice
print(my_model['country'][0:10])#does not work cause 'country' appears only once

例如，您可以在那里找到 Google 的模型，但您可以随意使用任何模型，也可以不使用，这不是我的帖子的重点。

正如代码注释中所通知的：在“巴黎”上运行模型有效，但不能在“国家”上运行。 当然，如果我将参数min_count设置为 1，则一切正常。

我希望它足够清楚。

谢谢。

Answer 1

如果您要求一个不存在的单词，它应该会引发错误，因为您选择不学习稀有单词的向量，例如您的示例中的'country' 。 （并且：这样的例子很少的词通常不会得到好的向量，保留它们会使剩余词的向量恶化，因此min_count尽可能大，并且可能比1大得多，通常是个好主意。 )

解决方法是执行以下操作之一：

不要问不存在的词。 首先检查，通过类似 Python 的in操作符。 例如：

if 'country' in my_model:
    print(my_model['country'][0:10])
else: 
    pass  # do nothing, since `min_count=2` means there's no 'country' vector

抓住错误，回到你想要发生的缺席词的任何事情：

try:
    print(my_model['country'][0:10])
except:
    pass  # do nothing, or perhaps print an error, whatever

改为使用始终为任何单词返回某些内容的模型，例如FastText它会尝试使用在训练期间学习的子词来合成未知单词的向量。 （这可能是垃圾，如果未知单词在字符和含义上与已知单词高度相似可能会很好，但对于某些用途来说，它总比没有好。）

[Word2Vec][gensim] 使用参数 min_count 处理词汇表中缺失的单词

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-19 00:10:24

[Word2Vec][gensim] 使用参数 min_count 处理词汇表中缺失的单词

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-19 00:10:24

解决方案1
1 已采纳 2020-03-19 00:10:24