简体   繁体   中英

Doc2vec: Only 10 docvecs in gensim doc2vec model?

I used gensim fit a doc2vec model, with tagged document (length>10) as training data. The target is to get doc vectors of all training docs, but only 10 vectors can be found in model.docvecs.

The example of training data (length>10)

docs = ['This is a sentence', 'This is another sentence', ....]

with some pre-treatment

doc_=[d.strip().split(" ") for d in doc]
doc_tagged = []
for i in range(len(doc_)):
  tagd = TaggedDocument(doc_[i],str(i))
  doc_tagged.append(tagd)

tagged docs

TaggedDocument(words=array(['a', 'b', 'c', ..., ],
  dtype='<U32'), tags='117')

fit a doc2vec model

model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(doc_tagged)
model.train(doc_tagged, total_examples= model.corpus_count, epochs= model.iter)

then i get the final model

len(model.docvecs)

the result is 10...

I tried other datasets (length>100, 1000) and got same result of len(model.docvecs) . So, my question is: How to use model.docvecs to get full vectors? (without using model.infer_vector ) Is model.docvecs designed to provide all training docvecs?

The bug is in this line:

tagd = TaggedDocument(doc[i],str(i))

Gensim's TaggedDocument accepts a sequence of tags as a second argument. When you pass a string '123' , it's turned into ['1', '2', '3'] , because it's treated as a sequence . As a result, all of the documents are tagged with just 10 tags ['0', ..., '9'] , in various combinations.

Another issue: you're defining doc_ and never actually using it, so your documents will be split incorrectly as well.

Here's the proper solution:

docs = [doc.strip().split(' ') for doc in docs]
tagged_docs = [doc2vec.TaggedDocument(doc, [str(i)]) for i, doc in enumerate(docs)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM