简体   繁体   中英

How doc2vec creates vector for sentence

I am working on Doc2vec for text classification. It is creating a vector for a sentence with some given size (eg: 100, length of vector). I am not able to understand how it creates vector of that length.

i am following this link . in here they are creating a vector for sentence which will be saved in the doc2v model, i can't use this model for new data(production data) to test as there is no vector for new sentence. Error showing for new data

KeyError: "tag 'Test_2028' not seen in training corpus/invalid"

Doc2Vec concept :

The goal of doc2vec is to create a numeric representation of a document, regardless of its length. But unlike words, documents do not come in logical structures such as words, so the another method has to be found.

The concept that Mikolov and Le have used was simple, yet clever: they have used the word2vec model, and added another vector, paragraph_ID , which is document-unique. Now, instead of using just words to predict the next word, we also added another feature vector.

So, when training the word vectors W , the document vector paragraph_ID is trained as well, and in the end of training, it holds a numeric representation of the document.

You can read more about it here

If you've created a gensim Doc2Vec model with your training data, it will only know trained vectors for the document tags that were present in the training data.

However, there's also the method infer_vector() which can infer a compatible document-vector for a new text. The new text should be tokenized the same as the training data, and passed as a list-of-string-tokens to infer_vector() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM