Why use TaggedBrownCorpus when training gensim doc2vec

Question

I am currently using custom corpus that wields Tagged Documents

class ClassifyCorpus(object):
    def __iter__(self):
        with open(train_data) as fp:
            for line in fp:
                splt = line.split(':')
                id = splt[0]
                text = splt[1].replace('\n', '')
                yield TaggedDocument(text.split(), [id])

Looking at the source code of Brown Corpus, is see that it just reads from directory and handles the tagging of the documents for me.

I tested it and didn't see improvements in the training speed.

Answer 1

You shouldn't use TaggedBrownCorpus . It's just a demo class for reading a particular tiny demo dataset that's included with gensim for unit-tests and intro tutorials.

It does things in a reasonable way for that data-format-on-disk, but any other efficient way of getting your data into a repeat-iterable sequence of TaggedDocument -like objects is just as good.

So feel free to use it as a model if it helps, but don't view it as a requirement or "best practice".

Why use TaggedBrownCorpus when training gensim doc2vec

Question

1 answers

solution1
1 ACCPTED 2018-11-29 12:58:10

Why use TaggedBrownCorpus when training gensim doc2vec

Question

1 answers

solution1 1 ACCPTED 2018-11-29 12:58:10

solution1
1 ACCPTED 2018-11-29 12:58:10