简体繁体 English

有没有办法正确标记（PoS 标记）一起形成短语的单词？

[英]Is there a way to correctly tag (PoS Tagging) the words which are forming a phrase together?

原文 2019-11-04 17:30:47 0 1 python/ nlp/ nltk/ pos-tagger

I tried various means to correctly tag a bunch of words which form a phrase (especially Noun Phrase) but could not succeed.我尝试了各种方法来正确标记一堆形成短语（尤其是名词短语）的单词，但未能成功。

Example: 'the', 'first', 'early','morning', 'sunbeams'示例：'the'、'first'、'early'、'morning'、'sunbeams'

'early' and 'morning' are wrongly being tagged as 'Noun' where expected outcome should be: ('first', 'adverb'), ('early', 'adverb'), ('morning', 'adjective'), ('sunbeams', 'noun') 'early' 和 'morning' 被错误地标记为 'Noun' 预期结果应该是：('first', 'adverb'), ('early', 'adverb'), ('morning', 'adjective') , ('阳光', '名词')

Could you please suggest a procedure to tag these words correctly?您能否建议一个正确标记这些单词的程序？

Thanks in advance.提前致谢。

1 个解决方案

POS taggers normally use Hidden Markov Models.词性标注器通常使用隐马尔可夫模型。 If your data is not tagged correctly with these methods, then either your tagger (selfmade?) is not suited for your input data or your training data is not adequate (too small, false annotations etc.).如果您的数据未使用这些方法正确标记，那么您的标记器（自制？）不适合您的输入数据，或者您的训练数据不足（太小、错误注释等）。 Various means I assume to be taggers from NLTK, spaCy, or tools from Stanford ( https://nlp.stanford.edu/software/ ).我认为各种方式是来自 NLTK、spaCy 的标记器或来自斯坦福的工具（ https://nlp.stanford.edu/software/ ）。 These software packages will do the job in the quality of current research, so if it is still error-prone, you won't be able to fix it.这些软件包将在当前研究的质量方面发挥作用，所以如果它仍然容易出错，你将无法修复它。 If you have a large cluster at hand, build your own tagger using n-grams with n > 3, if you like, but I doubt this will be any better than the modules named above.如果您手头有一个大型集群，如果您愿意，可以使用 n > 3 的 n-gram 构建您自己的标注器，但我怀疑这会比上面提到的模块更好。