I'm new to NLP and gensim, currently trying to solve some NLP problems with gensim word2vec module. I my current understanding of word2vec, the result vectors/matrix should have all entries between -1 and 1. However, trying a simple one results into a vector which has entries greater than 1. I'm not sure which part is wrong, could anyone give some suggestions, please?
I've used gensim utils.simple_preprocess to generate a list of list of token. The list looks like:
[['buffer', 'overflow', 'in', 'client', 'mysql', 'cc', 'in', 'oracle', 'mysql', 'and', 'mariadb', 'before', 'allows', 'remote', 'database', 'servers', 'to', 'cause', 'denial', 'of', 'service', 'crash', 'and', 'possibly', 'execute', 'arbitrary', 'code', 'via', 'long', 'server', 'version', 'string'], ['the', 'xslt', 'component', 'in', 'apache', 'camel', 'before', 'and', 'before', 'allows', 'remote', 'attackers', 'to', 'read', 'arbitrary', 'files', 'and', 'possibly', 'have', 'other', 'unspecified', 'impact', 'via', 'an', 'xml', 'document', 'containing', 'an', 'external', 'entity', 'declaration', 'in', 'conjunction', 'with', 'an', 'entity', 'reference', 'related', 'to', 'an', 'xml', 'external', 'entity', 'xxe', 'issue']]
I believe this is the correct input format for gensim word2vec.
word2vec = models.word2vec.Word2Vec(sentences, size=50, window=5, min_count=1, workers=3, sg=1)
vector = word2vec['overflow']
print(vector)
I expect the output to be a vector containing probabilities (ie, all between -1 and 1), but it actually turned out to be the following:
[ 0.12800379 -0.7405527 -0.85575 0.25480416 -0.2535793 0.142656
-0.6361196 -0.13117172 1.1251501 0.5350017 0.05962601 -0.58876884
0.02858278 0.46106443 -0.22623934 1.6473309 0.5096218 -0.06609935
-0.70007527 1.0663376 -0.5668168 0.96070313 -1.180383 -0.58649933
-0.09380565 -0.22683378 0.71361005 0.01779896 0.19778453 0.74370056
-0.62354785 0.11807996 -0.54997736 0.10106519 0.23364201 -0.11299669
-0.28960565 -0.54400533 0.10737313 0.3354464 -0.5992898 0.57183135
-0.67273194 0.6867607 0.2173506 0.15364875 0.7696457 -0.24330224
0.46414775 0.98163396]
You can see there are 1.6473309
and -1.180383
in the above vector.
It's not the case that individual word-vectors will have all their individual dimensions between -1.0
and 1.0
.
Nor is it the case that the dimensions should be interpreted as "probabilities".
Rather, the word-vectors are learned such that the internal neural-network becomes as good as possible at predicting words from surrounding words. There's no constraint or normalization during that training forcing the individual dimensions into a restricted range, or making individual dimensions interpretable as nameable qualities.
It is sometimes the case that such vectors are converted, after training, into vectors of normalized unit-length, before comparison to each other. And further, when you request the cosine-similarity between two vectors, the result will always be in the range from -1.0
to 1.0
. And, before doing the very-common most_similar()
operation (or similar), the Word2Vec
class with bulk-unit-normalize vectors & cache the results internally.
But, directly asking for the raw word-vector, as per model.wv['overflow']
, will return the raw vector with whatever original overall magnitude, and per-dimension values, as came from training. You can request the unit-normed vector instead with:
model.wv.word_vec('overflow', use_norm=True)
(Separately be aware: testing Word2Vec
on tiny toy-sized datasets will generally not get useful or realistic results: the algorithm really requires large, varied data to come up with balanced, useful word-vectors. For example, to train-up 50-dimensional vectors, I'd want at least 2,500 unique words in the vocabulary, with dozens of different uses of each word – so a corpus of many tens of thousands of words. And I might also use more than the default epochs=5
, because that's still a very small corpus.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.