简体   繁体   中英

Gensim doc2vec training on ngrams

I have several thousand documents that I'd like to use in a gensim doc2vec model, but I only have 5grams for each of the documents, not the full texts in their original word order. In the doc2vec tutorial on the gensim website ( https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html ), a corpus is created with full texts and then the model is trained on that corpus. It looks something like this:

[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern',...], tags=[1]), TaggedDocument(words=[.....], tags=[2]),...]

Is it possible create a training corpus where each document consists of a list of 5grams rather than a list of words in their original order?

If you have "all" the 5-grams from the documents – perhaps even still in the order they appeared – it should be possible to stitch-together the original documents (or nearly-equivalent pseudo-documents), as if the 5-grams were puzzle-pieces or dominoes.

(For example, find the 1st 5-gram, by either its ordinal position in your data, or by finding a 5-gram whose 4-prefix-tokens isn't any other 5-grams' 4-suffix-tokens. Then, find its successor by matching its 4-suffix-tokens to the 4-prefix-tokens of another candidate 5-gram. If at any point you have more than one candidate 'start' or 'continuation', you could try any one & keep going until you either finish or reach a dead end – depth-1st search for consistent chains – & if a dead-end, then back up & try another. Though also, you could probably just pick another good start 5-gram, & continue, at risk fo slightly misordering the document & repeating a few tokens. A bunch of such errors won't have much effect on final results in a large corpus.)

Alternatively, the 'PV-DBOW' mode ( dm=0 ) doesn't use context-windows or neighboring words – so getting the exact original word order doesn't matter, just stand-in documents with the right words in any order. So just concatenating all the 5-grams creates a reasonable pseudo-document – especially if you then discard 4/5 of any word (to account for the fact that any one word in the original doc, except at the very beginning or end, appears in 5 5-grams).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM