简体   繁体   中英

NLTK POS tagging using my own tagged corpus?

I'm attempting to write a basic POS tagger for the Dothraki language using the NLTK. Similar to the Brown Corpus, I've got my own .txt file with words and their associated parts of speech. For example...

Anha/PRP vidrik/VBP khalasares/NN anni/NN jim/NN

What I'd like to do is load that corpus in to NLTK and be able to see the parts-of-speech alongside the words, similar to how the Brown Corpus does it. So this is what I'm doing:

from nltk.corpus.reader import TaggedCorpusReader

corpus_root = '...'
dothraki_corpus_tagged = TaggedCorpusReader(corpus_root, ".*", ".txt")
print (dothraki_corpus_tagged.tagged_sents('dt01.txt'))

But my result is:

[[('Anha/PRP', None), ('vidrik/VBP', None), ('khalasares/NN', None), ('anni/NN', None), ('jim/NN', None)]]

Instead of

[[('Anha', 'PRP'), ('vidrik', 'VBP') ...]]

So I feel kind of dumb right now, but I managed to get what I wanted by simply deleting the ".*" from the TaggedCorpusReader parameters. So what I've got now is:

dothraki_corpus_tagged = TaggedCorpusReader(corpus_root, ".txt")
print (dothraki_corpus_tagged.tagged_sents('dothraki_01.txt'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM