简体   繁体   中英

gensim word2vec entry greater than 1

I'm new to NLP and gensim, currently trying to solve some NLP problems with gensim word2vec module. I my current understanding of word2vec, the result vectors/matrix should have all entries between -1 and 1. However, trying a simple one results into a vector which has entries greater than 1. I'm not sure which part is wrong, could anyone give some suggestions, please?

I've used gensim utils.simple_preprocess to generate a list of list of token. The list looks like:

[['buffer', 'overflow', 'in', 'client', 'mysql', 'cc', 'in', 'oracle', 'mysql', 'and', 'mariadb', 'before', 'allows', 'remote', 'database', 'servers', 'to', 'cause', 'denial', 'of', 'service', 'crash', 'and', 'possibly', 'execute', 'arbitrary', 'code', 'via', 'long', 'server', 'version', 'string'], ['the', 'xslt', 'component', 'in', 'apache', 'camel', 'before', 'and', 'before', 'allows', 'remote', 'attackers', 'to', 'read', 'arbitrary', 'files', 'and', 'possibly', 'have', 'other', 'unspecified', 'impact', 'via', 'an', 'xml', 'document', 'containing', 'an', 'external', 'entity', 'declaration', 'in', 'conjunction', 'with', 'an', 'entity', 'reference', 'related', 'to', 'an', 'xml', 'external', 'entity', 'xxe', 'issue']]

I believe this is the correct input format for gensim word2vec.

word2vec = models.word2vec.Word2Vec(sentences, size=50, window=5, min_count=1, workers=3, sg=1)
vector = word2vec['overflow']
print(vector)

I expect the output to be a vector containing probabilities (ie, all between -1 and 1), but it actually turned out to be the following:

[ 0.12800379 -0.7405527  -0.85575     0.25480416 -0.2535793   0.142656
 -0.6361196  -0.13117172  1.1251501   0.5350017   0.05962601 -0.58876884
  0.02858278  0.46106443 -0.22623934  1.6473309   0.5096218  -0.06609935
 -0.70007527  1.0663376  -0.5668168   0.96070313 -1.180383   -0.58649933
 -0.09380565 -0.22683378  0.71361005  0.01779896  0.19778453  0.74370056
 -0.62354785  0.11807996 -0.54997736  0.10106519  0.23364201 -0.11299669
 -0.28960565 -0.54400533  0.10737313  0.3354464  -0.5992898   0.57183135
 -0.67273194  0.6867607   0.2173506   0.15364875  0.7696457  -0.24330224
  0.46414775  0.98163396]

You can see there are 1.6473309 and -1.180383 in the above vector.

It's not the case that individual word-vectors will have all their individual dimensions between -1.0 and 1.0 .

Nor is it the case that the dimensions should be interpreted as "probabilities".

Rather, the word-vectors are learned such that the internal neural-network becomes as good as possible at predicting words from surrounding words. There's no constraint or normalization during that training forcing the individual dimensions into a restricted range, or making individual dimensions interpretable as nameable qualities.

It is sometimes the case that such vectors are converted, after training, into vectors of normalized unit-length, before comparison to each other. And further, when you request the cosine-similarity between two vectors, the result will always be in the range from -1.0 to 1.0 . And, before doing the very-common most_similar() operation (or similar), the Word2Vec class with bulk-unit-normalize vectors & cache the results internally.

But, directly asking for the raw word-vector, as per model.wv['overflow'] , will return the raw vector with whatever original overall magnitude, and per-dimension values, as came from training. You can request the unit-normed vector instead with:

model.wv.word_vec('overflow', use_norm=True)

(Separately be aware: testing Word2Vec on tiny toy-sized datasets will generally not get useful or realistic results: the algorithm really requires large, varied data to come up with balanced, useful word-vectors. For example, to train-up 50-dimensional vectors, I'd want at least 2,500 unique words in the vocabulary, with dozens of different uses of each word – so a corpus of many tens of thousands of words. And I might also use more than the default epochs=5 , because that's still a very small corpus.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM