简体   繁体   English

使用gensim和预训练的word2vec模型管理KeyError

[英]Manage KeyError with gensim and pretrained word2vec model

I pretrained a word embedding using wang2vec ( https://github.com/wlin12/wang2vec ), and i loaded it in python through gensim. 我使用wang2vec( https://github.com/wlin12/wang2vec )对单词嵌入进行了预训练,然后通过gensim将其加载到python中。 When i tried to get the vector of some words not in vocabulary, i obviously get: 当我试图获取某些单词不在词汇表中的向量时,我显然得到:

KeyError: "word 'kjklk' not in vocabulary"

So, i thought about adding an item to the vocabulary to map oov (out of vocabulary) words, let's say <OOV> . 因此,我考虑过要在词汇表中添加一个项以映射oov(词汇表之外)的单词,比如说<OOV> Since the vocabulary is in Dict format, i would simply add the item {"<OOV>":0} . 由于词汇是Dict格式的,因此我只需添加项目{"<OOV>":0}

But, i searched an item of the vocabulary, with 但是,我搜索了一个词汇,

model = gensim.models.KeyedVectors.load_word2vec_format(w2v_ext, binary=False, unicode_errors='ignore')
dict(list(model.vocab.items())[5:6])

The output was something like 输出是这样的

{'word': <gensim.models.keyedvectors.Vocab at 0x7fc5aa6007b8>}

So, is there a way to add the <OOV> token to the vocabulary of a pretrained word embedding loaded through gensim, and avoid the KeyError? 因此,是否有办法将<OOV>令牌添加到通过gensim加载的预训练单词嵌入的词汇表中,并避免KeyError? I looked at gensim doc and i found this: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.build_vocab but it seems not work with the update parameter. 我查看了gensim doc,发现了这一点: https ://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.build_vocab,但它似乎不适用于update参数。

Adding a synthetic '<OOV>' token would just let you look up that token, like model['<OOV>'] .The model would still give key errors for absent keys like 'kjklk' . 添加一个合成的'<OOV>'令牌只会让您查找该令牌,就像model['<OOV>'] 。该模型仍然会为缺少键的错误提供键错误,例如'kjklk'

There's no built-in support for adding any such 'catch-all' mapping. 没有内置支持添加任何此类“全部捕获”映射。 Often, ignoring unknown tokens is better than using some plug value (such as a zero-vector or random-vector). 通常,忽略未知标记比使用某些插入值(例如零向量或随机向量)更好。

It's fairly idiomatic in Python to explicitly check if a key is present, via the in keyword, if you want to do something different for absent keys. 在Python中,通过in关键字显式检查某个键是否存在是相当习惯的,如果您想对不存在的键做一些不同的事情。 For example: 例如:

vector = model['kjklk'] if 'kjklk' in model else DEFAULT_VECTOR

(Notably, the *expr1* if *expr2* else *expr3* defers evaluation of the initial expr1 , avoiding KeyError .) (值得注意的是, *expr1* if *expr2* else *expr3*推迟对初始expr1的 ,从而避免KeyError 。)

Python also has the defaultdict variant dictionary, which can have a default value returned for any unknown key. Python还具有defaultdict变体字典,可以为任何未知键返回默认值。 See: 看到:

https://docs.python.org/3.7/library/collections.html#collections.defaultdict https://docs.python.org/3.7/library/collections.html#collections.defaultdict

It'd be possible to try replacing the KeyedVectors vocab dictionary with one of those, if the behavior is really important, but there could be side effects on other code. 如果该行为确实很重要,则可以尝试用其中之一替换KeyedVectors vocab词典,但是对其他代码可能会有副作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM