简体   繁体   中英

Gensim word2vec - start vocabulary from index different than 0

I am using gensim to create word vectors based on my corpus like the following:

model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

I was wondering if it is possible to start (or somehow avoid having) words at index 0 and 1? I would like my vocabulary to start at index 2, because I need to do other operations and if I keep 0 and 1 as indexes it gets a little confusing.

Thanks for the help!

It's not a native feature of Word2Vec .

This is probably not a good idea, but you could crudely fake it by creating two dummy words with very high-frequency, and add examples containing them to your training data in a way to have a minimal impact on other vectors.

For example, if the most-common word in your corpus occurs 5,000 times, create a fake text with just the words 'dummy000000000' and 'dummy000000001' in it, repeated 1,000 times each. Add this fake text to your corpus 6 times. Then, 'dummy000000000' and 'dummy000000001' will be the two most-frequent words in the corpus, and thus get indexes 0 and 1 (in the usual case). Their training will waste time, and the model will waste a little bit of its potential state giving those words crude vectors, but they should have a minimal effect on other words (since they never co-occur with real words). Voila, you've got 0 and 1 indexes you can ignore (or treat as errors) later!

But having written it out, it's pretty definitely a bad idea. It'll slow and worsen the model slightly. Various progress/tally statistics from the model will be subtly misleading.

And, having such indexes start at 0 is very typical professional programming practice. If you find it confusing, in general or for your specific project, that may be a habit/understanding barrier that it's better to work-through than try to patch-around with non-standard practice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM