简体   繁体   中英

How does nltk.pos_tag() work?

How does nltk.pos_tag() work? Does it involve any corpus use? I found a source code ( nltk.tag - NLTK 3.0 documentation) and it says

_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'.

Loading _POS_TAGGER gives an object:

nltk.tag.sequential.ClassifierBasedPOSTagger

, which seems to have no training from corpus. The tagging is incorrect when I use a few adjective in series before a noun (eg the quick brown fox ). I wonder if I can improve the result by using better tagging method or somehow training with better corpus. Any suggestions?

According to the source code , pos_tag uses NLTK's currently reccomended POS tagger, which is PerceptronTagger as of 2018.

Here is the documentation for PerceptronTagger and here is the source code .

To use the tagger you can simply call pos_tag(tokens) . This will call PerceptronTagger 's default constructor, which uses a "pretrained" model. This is a pickled model that NLTK distributes, file located at: taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle . This is trained and tested on the Wall Street Journal corpus.

Alternatively, you can instantiate a PerceptronTagger and train its model yourself by providing tagged examples, eg:

tagger = PerceptronTagger(load=False) # don't load existing model
tagger.train([[('today','NN'),('is','VBZ'),('good','JJ'),('day','NN')],
[('yes','NNS'),('it','PRP'),('beautiful','JJ')]])

The documentation links to this blog post which does a good job of describing the theory.

TL;DR: PerceptronTagger is a greedy averaged perceptron tagger. This basically means that it has a dictionary of weights associated with features, which it uses to predict the correct tag for a given set of features. During training, the tagger guesses a tag and adjusts weights according to whether or not the guess was correct. "Averaged" means the weight adjustments are averaged over the number of iterations.

The tagger is a machine-learning tagger that has been trained and saved for you. No tagger is perfect, but if you want optimal performance you shouldn't try to roll your own. Look around for state-of-the art taggers that are free to download and use, such as the Stanford tagger, for which the NLTK provides an interface.

For the Stanford tagger in particular, see help(nltk.tag.stanford) . You'll need to download the Stanford tools yourself from http://nlp.stanford.edu/software/ .

是的,它涉及名为Penn Tree Bank的语料库,它定义了句法和语义信息,一组语言树。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM