NLTK POS tagging using my own tagged corpus?

Question

I'm attempting to write a basic POS tagger for the Dothraki language using the NLTK. Similar to the Brown Corpus, I've got my own .txt file with words and their associated parts of speech. For example...

Anha/PRP vidrik/VBP khalasares/NN anni/NN jim/NN

What I'd like to do is load that corpus in to NLTK and be able to see the parts-of-speech alongside the words, similar to how the Brown Corpus does it. So this is what I'm doing:

from nltk.corpus.reader import TaggedCorpusReader

corpus_root = '...'
dothraki_corpus_tagged = TaggedCorpusReader(corpus_root, ".*", ".txt")
print (dothraki_corpus_tagged.tagged_sents('dt01.txt'))

But my result is:

[[('Anha/PRP', None), ('vidrik/VBP', None), ('khalasares/NN', None), ('anni/NN', None), ('jim/NN', None)]]

Instead of

[[('Anha', 'PRP'), ('vidrik', 'VBP') ...]]

Answer 1

So I feel kind of dumb right now, but I managed to get what I wanted by simply deleting the ".*" from the TaggedCorpusReader parameters. So what I've got now is:

dothraki_corpus_tagged = TaggedCorpusReader(corpus_root, ".txt")
print (dothraki_corpus_tagged.tagged_sents('dothraki_01.txt'))

NLTK POS tagging using my own tagged corpus?

Question

1 answers

solution1
2 ACCPTED

NLTK POS tagging using my own tagged corpus?

Question

1 answers

solution1 2 ACCPTED

solution1
2 ACCPTED