简体   繁体   中英

Why use TaggedBrownCorpus when training gensim doc2vec

I am currently using custom corpus that wields Tagged Documents

class ClassifyCorpus(object):
    def __iter__(self):
        with open(train_data) as fp:
            for line in fp:
                splt = line.split(':')
                id = splt[0]
                text = splt[1].replace('\n', '')
                yield TaggedDocument(text.split(), [id])

Looking at the source code of Brown Corpus, is see that it just reads from directory and handles the tagging of the documents for me.

I tested it and didn't see improvements in the training speed.

You shouldn't use TaggedBrownCorpus . It's just a demo class for reading a particular tiny demo dataset that's included with gensim for unit-tests and intro tutorials.

It does things in a reasonable way for that data-format-on-disk, but any other efficient way of getting your data into a repeat-iterable sequence of TaggedDocument -like objects is just as good.

So feel free to use it as a model if it helps, but don't view it as a requirement or "best practice".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM