简体   繁体   English

nltk.pos_tag()如何工作?

[英]How does nltk.pos_tag() work?

How does nltk.pos_tag() work? nltk.pos_tag()如何工作? Does it involve any corpus use? 它是否涉及任何语料库使用? I found a source code ( nltk.tag - NLTK 3.0 documentation) and it says 我发现了一个源代码( nltk.tag - NLTK 3.0文档)

_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'.

Loading _POS_TAGGER gives an object: 加载_POS_TAGGER会给出一个对象:

nltk.tag.sequential.ClassifierBasedPOSTagger

, which seems to have no training from corpus. ,似乎没有语料库的训练。 The tagging is incorrect when I use a few adjective in series before a noun (eg the quick brown fox ). 当我在名词之前使用一些串联形容词(例如快速棕色狐狸 )时,标记是不正确的。 I wonder if I can improve the result by using better tagging method or somehow training with better corpus. 我想知道我是否可以通过使用更好的标记方法或以更好的语料库进行某种程度的训练来改善结果。 Any suggestions? 有什么建议?

According to the source code , pos_tag uses NLTK's currently reccomended POS tagger, which is PerceptronTagger as of 2018. 根据源代码pos_tag使用NLTK目前推荐的POS标记器,截至2018年是PerceptronTagger

Here is the documentation for PerceptronTagger and here is the source code . 这是 PerceptronTagger 的文档这是源代码

To use the tagger you can simply call pos_tag(tokens) . 要使用标记器,您只需调用pos_tag(tokens) This will call PerceptronTagger 's default constructor, which uses a "pretrained" model. 这将调用PerceptronTagger的默认构造函数,该构造函数使用“预训练”模型。 This is a pickled model that NLTK distributes, file located at: taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle . 这是NLTK分发的腌制模型,文件位于: taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle This is trained and tested on the Wall Street Journal corpus. 这是在华尔街日报语料库上进行培训和测试的。

Alternatively, you can instantiate a PerceptronTagger and train its model yourself by providing tagged examples, eg: 或者,您可以通过提供标记示例来实例化PerceptronTaggerPerceptronTagger训练其模型,例如:

tagger = PerceptronTagger(load=False) # don't load existing model
tagger.train([[('today','NN'),('is','VBZ'),('good','JJ'),('day','NN')],
[('yes','NNS'),('it','PRP'),('beautiful','JJ')]])

The documentation links to this blog post which does a good job of describing the theory. 该文档链接到此博客文章该文章很好地描述了该理论。

TL;DR: PerceptronTagger is a greedy averaged perceptron tagger. TL; DR: PerceptronTagger是一个贪婪的平均感知器标记器。 This basically means that it has a dictionary of weights associated with features, which it uses to predict the correct tag for a given set of features. 这基本上意味着它具有与特征相关联的权重字典,它用于预测给定特征集的正确标记。 During training, the tagger guesses a tag and adjusts weights according to whether or not the guess was correct. 在训练期间,标记器猜测标记并根据猜测是否正确来调整权重。 "Averaged" means the weight adjustments are averaged over the number of iterations. “平均值”表示权重调整在迭代次数上取平均值。

The tagger is a machine-learning tagger that has been trained and saved for you. 标记器是一个机器学习标记器,已经过训练并为您保存。 No tagger is perfect, but if you want optimal performance you shouldn't try to roll your own. 没有标记器是完美的,但如果你想要最佳性能,你不应该尝试自己动手。 Look around for state-of-the art taggers that are free to download and use, such as the Stanford tagger, for which the NLTK provides an interface. 查看可免费下载和使用的最先进的标记器,例如Stanford标记器,NLTK为其提供接口。

For the Stanford tagger in particular, see help(nltk.tag.stanford) . 特别是斯坦福标记器,请参阅help(nltk.tag.stanford) You'll need to download the Stanford tools yourself from http://nlp.stanford.edu/software/ . 您需要自己从http://nlp.stanford.edu/software/下载斯坦福工具。

是的,它涉及名为Penn Tree Bank的语料库,它定义了句法和语义信息,一组语言树。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM