简体   繁体   中英

How to obtain document vectors in doc2vec in gensim

I know to obtain a document vector for a given tag in doc2vec using print(model.docvecs['recipe__11']) .

My document vectors are either recipes (tags start with recipe__ ), newspapers (tags start with news__ ) or ingredients (tags start with ingre__ )

Now I want to retrieve all the document vectors of recipes. The pattern of my recipe documents is recipe__<some number> (eg, recipe__23, recipe__34). I am interested in knowing if it possible to obtain multiple document vectors using a pattern (eg, tags starting with recipe__ )

Please help me!

There's no pattern-retrieval, but you can access the list of all known (string) doc-tags in model.docvecs.offset2doctag . You could then loop over that list to find all matches, and retrieve each individually.

Also, all the doc-vectors are in a large array model.docvecs.doctag_syn0 And, if you've used exclusively string doc-tags, then the position of a tag in offset2doctag will be exactly the index of the corresponding vector in doctag_syn0 . That would allow you to use numpy 'mask indexing' to grab a subset of vectors as a new array, like:

recipes_mask = [tag.startswith('recipe_') for tag in model.dacvecs.offset2doctag]
recipes_vectors = model.docvecs.doctag_syn0[recipes_mask]

Of course, this array-of-vectors no longer has the recipes in the same positions as the original, so you'd need extra steps to know where (for example) the 'recipe__11' vector is in recipes_vectors .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM