简体繁体中英

Python: What is the “size” parameter in Gensim Word2vec model class

原文 2017-08-01 18:12:40 4 2 python/ gensim/ word2vec

I have been struggling to understand the use of size parameter in the gensim.models.Word2Vec

From the Gensim documentation, size is the dimensionality of the vector. Now, as far as my knowledge goes, word2vec creates a vector of the probability of closeness with the other words in the sentence for each word. So, suppose if my vocab size is 30 then how does it create a vector with the dimension greater than 30? Can anyone please brief me on the optimal value of Word2Vec size?

Thank you.

2 answers

size is, as you note, the dimensionality of the vector.

Word2Vec needs large, varied text examples to create its 'dense' embedding vectors per word. (It's the competition between many contrasting examples during training which allows the word-vectors to move to positions that have interesting distances and spatial-relationships with each other.)

If you only have a vocabulary of 30 words, word2vec is unlikely an appropriate technology. And if trying to apply it, you'd want to use a vector size much lower than your vocabulary size – ideally much lower. For example, texts containing many examples of each of tens-of-thousands of words might justify 100-dimensional word-vectors.

Using a higher dimensionality than vocabulary size would more-or-less guarantee 'overfitting'. The training could tend toward an idiosyncratic vector for each word – essentially like a 'one-hot' encoding – that would perform better than any other encoding, because there's no cross-word interference forced by representing a larger number of words in a smaller number of dimensions.

That'd mean a model that does about as well as possible on the Word2Vec internal nearby-word prediction task – but then awful on other downstream tasks, because there's been no generalizable relative-relations knowledge captured. (The cross-word interference is what the algorithm needs , over many training cycles, to incrementally settle into an arrangement where similar words must be similar in learned weights, and contrasting words different.)

It's equal to vector_size. To make it easy, it's a uniform size of dimension of the output vectors for each word that you trained with word2vec.

What is the `null_word` parameter in gensim Word2Vec?

What is the meaning of “size” of word2vec vectors [gensim library]?

Python gensim create word2vec model from vectors (in ndarray)

Layer size in gensim's word2vec

Incremental Word2Vec Model Training in gensim

Gensim Word2Vec model floating point

Gensim Word2Vec model: Cut dimensions

Class of word2vec model (Python)

Python Gensim word2vec vocabulary key

Different models with gensim Word2Vec on python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question What is the `null_word` parameter in gensim Word2Vec? What is the meaning of “size” of word2vec vectors [gensim library]? Python gensim create word2vec model from vectors (in ndarray) Layer size in gensim's word2vec Incremental Word2Vec Model Training in gensim Gensim Word2Vec model floating point Gensim Word2Vec model: Cut dimensions Class of word2vec model (Python) Python Gensim word2vec vocabulary key Different models with gensim Word2Vec on python

Related Tags

Python: What is the “size” parameter in Gensim Word2vec model class

Question

2 answers

solution1
19 ACCPTED 2017-08-02 06:28:45

solution2
0 2020-12-23 23:18:51

Python: What is the “size” parameter in Gensim Word2vec model class

Question

2 answers

solution1 19 ACCPTED 2017-08-02 06:28:45

solution2 0 2020-12-23 23:18:51

solution1
19 ACCPTED 2017-08-02 06:28:45

solution2
0 2020-12-23 23:18:51