I am currently using custom corpus that wields Tagged Documents
class ClassifyCorpus(object):
def __iter__(self):
with open(train_data) as fp:
for line in fp:
splt = line.split(':')
id = splt[0]
text = splt[1].replace('\n', '')
yield TaggedDocument(text.split(), [id])
Looking at the source code of Brown Corpus, is see that it just reads from directory and handles the tagging of the documents for me.
I tested it and didn't see improvements in the training speed.
You shouldn't use TaggedBrownCorpus
. It's just a demo class for reading a particular tiny demo dataset that's included with gensim for unit-tests and intro tutorials.
It does things in a reasonable way for that data-format-on-disk, but any other efficient way of getting your data into a repeat-iterable sequence of TaggedDocument
-like objects is just as good.
So feel free to use it as a model if it helps, but don't view it as a requirement or "best practice".
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.