简体繁体 English

是否有预训练的 Gensim 短语模型？

[英]Is there a pretrained Gensim phrase model?

原文 2020-10-15 18:43:57 9 1 python/ machine-learning/ gensim/ word-embedding/ phrase

Is there a pretrained Gensim 's Phrases model?是否有预训练的Gensim的短语模型？ If not, would it be possible to reverse engineer and create a phrase model using a pretrained word embedding?如果没有，是否可以使用预训练的词嵌入进行逆向工程并创建短语模型？

I am trying to use GoogleNews-vectors-negative300.bin with Gensim's Word2Vec .我正在尝试将GoogleNews-vectors-negative300.bin与 Gensim 的Word2Vec 。 First, I need to map my words into phrases so that I can look up their vectors from the Google's pretrained embedding.首先，我需要将我的单词映射到短语中，以便我可以从 Google 的预训练嵌入中查找它们的向量。

I search on the official Gensim's documentation but could not find any info.我搜索了 Gensim 的官方文档，但找不到任何信息。 Thanks!谢谢！

1 个解决方案

I'm not aware of anyone sharing a Phrases model.我不知道有人分享Phrases模型。 Any such model would be very sensitive to the preprocessing/tokenization step, and the specific parameters, the creator used.任何这样的模型都会对预处理/标记化步骤以及创建者使用的特定参数非常敏感。

Other than the high-level algorithm description, I haven't seen Google's exact choices for tokenization/canonicalization/phrase-combination done to the data that fed into the GoogleNews 2013 word-vectors have been documented anywhere.除了高级算法描述之外，我还没有看到 Google 对输入GoogleNews 2013 词向量的数据所做的标记化/规范化/短语组合的确切选择已在任何地方记录。 Some guesses about preprocessing can be made by reviewing the tokens present, but I'm unaware of any code to apply similar choices to other text.通过查看存在的标记可以对预处理做出一些猜测，但我不知道有任何代码可以将类似的选择应用于其他文本。

You could try to mimic their unigram tokenization, then speculatively combine strings of unigrams into ever-longer multigrams up to some maximum, check if those combinations are present, and when not present, revert to the unigrams (or largest combination present).您可以尝试模仿它们的 unigram 标记化，然后推测性地将 unigram 字符串组合成更长的 multigrams，直到某个最大值，检查这些组合是否存在，如果不存在，则恢复为 unigrams（或存在的最大组合）。 This might be expensive if done naively, but be amenable to optimizations if really important - especially for some subset of the more-frequent words – as the GoogleNews set appears to obey the convention of listing words in descending frequency.如果天真地完成，这可能会很昂贵，但如果真的很重要，则可以进行优化 - 特别是对于更频繁的单词的某些子集 - 因为GoogleNews集似乎遵守GoogleNews频列出单词的惯例。

(In general, though it's a quick & easy starting set of word-vectors, I think GoogleNews is a bit over-relied upon. It will lack words/phrases and new senses that have developed since 2013, and any meanings it does capture are determined by news articles in the years leading up to 2013... which may not match the dominant senses of words in other domains. If your domain isn't specifically news, and you have sufficient data, deciding your own domain-specific tokenization/combination will likely perform better.) （总的来说，虽然它是一组快速而简单的词向量，但我认为GoogleNews有点过分依赖。它会缺少自 2013 年以来发展起来的词/短语和新含义，并且它确实捕获的任何含义都是由 2013 年之前几年的新闻文章决定......这可能与其他领域的主要词义不匹配。如果您的领域不是专门的新闻，并且您有足够的数据，请决定您自己的特定领域标记化/组合可能会表现得更好。）