简体   繁体   English

Gensim word2vec-从不同于0的索引开始词汇表

[英]Gensim word2vec - start vocabulary from index different than 0

I am using gensim to create word vectors based on my corpus like the following: 我正在使用gensim根据我的语料库创建单词向量,如下所示:

model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

I was wondering if it is possible to start (or somehow avoid having) words at index 0 and 1? 我想知道是否可以在索引0和1处开始(或以某种方式避免使用)单词? I would like my vocabulary to start at index 2, because I need to do other operations and if I keep 0 and 1 as indexes it gets a little confusing. 我希望我的词汇表从索引2开始,因为我需要执行其他操作,并且如果我将0和1保留为索引,则会有些混乱。

Thanks for the help! 谢谢您的帮助!

It's not a native feature of Word2Vec . 它不是Word2Vec的本机功能。

This is probably not a good idea, but you could crudely fake it by creating two dummy words with very high-frequency, and add examples containing them to your training data in a way to have a minimal impact on other vectors. 这可能不是一个好主意,但是您可以通过以很高的频率创建两个伪单词并以包含对它们的最小影响的方式,将包含它们的示例添加到您的训练数据中, 粗略地伪造它。

For example, if the most-common word in your corpus occurs 5,000 times, create a fake text with just the words 'dummy000000000' and 'dummy000000001' in it, repeated 1,000 times each. 例如,如果语料库中最常见的单词出现了5,000次,则创建一个仅包含单词“ dummy000000000”和“ dummy000000001”的假文本,每个单词重复1000次。 Add this fake text to your corpus 6 times. 将此伪造的文字添加到您的语料库6次。 Then, 'dummy000000000' and 'dummy000000001' will be the two most-frequent words in the corpus, and thus get indexes 0 and 1 (in the usual case). 然后,“ dummy000000000”和“ dummy000000001”将是语料库中两个最常见的词,并因此获得索引0和1(在通常情况下)。 Their training will waste time, and the model will waste a little bit of its potential state giving those words crude vectors, but they should have a minimal effect on other words (since they never co-occur with real words). 他们的训练将浪费时间,并且该模型将浪费其潜在状态给这些单词粗略的向量,但它们对其他单词的影响应最小(因为它们从未与真实单词共发)。 Voila, you've got 0 and 1 indexes you can ignore (or treat as errors) later! 瞧,您有了0和1个索引,以后可以忽略(或将其视为错误)!

But having written it out, it's pretty definitely a bad idea. 但是写出来后,这绝对不是一个好主意。 It'll slow and worsen the model slightly. 它将使模型稍微变慢和恶化。 Various progress/tally statistics from the model will be subtly misleading. 模型中的各种进度/统计数据都会产生误导。

And, having such indexes start at 0 is very typical professional programming practice. 并且,使此类索引从0开始是非常典型的专业编程实践。 If you find it confusing, in general or for your specific project, that may be a habit/understanding barrier that it's better to work-through than try to patch-around with non-standard practice. 如果您发现它令人困惑,无论是在一般情况下还是在您的特定项目中,这可能是一个习惯/理解上的障碍,那就是通读起来比尝试用非标准实践打补丁要好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM