简体   繁体   English

python NLTK POS标记器行为异常

[英]python NLTK POS tagger not behaving as expected

I ran pos_tag function on below text,it shows output with battery as 'RB'. 我在下面的文本上运行了pos_tag函数,它将电池的输出显示为“ RB”。 As battery is noun, it should show as 'NN'. 由于电池是名词,因此应显示为“ NN”。

nltk.pos_tag(nltk.word_tokenize('Camera picture quality was fair but speed was an issue and also battery life was not that good'))

Output: 输出:

[('Camera', 'NNP'), ('picture', 'NN'), ('quality', 'NN'), ('was', 'VBD'), ('fair', 'JJ'), ('but', 'CC'), ('speed', 'NN'), ('was', 'VBD'), ('an', 'DT'), ('issue', 'NN'), ('and', 'CC'), ('also', 'RB'), ('battery', 'RB'), ('life', 'NN'), ('was', 'VBD'), ('not', 'RB'), ('that', 'IN'), ('good', 'JJ')] [('Camera','NNP'),('picture','NN'),('quality','NN'),('was','VBD'),('fair','JJ') ,('but','CC'),('speed','NN'),('was','VBD'),('an','DT'),('issue','NN') ,('and','CC'),('also','RB'),('battery','RB'),('life','NN'),('was','VBD') ,('not','RB'),('that','IN'),('good','JJ')]

While if I POS tagged the same statement by this tagger http://cst.dk/online/pos_tagger/uk/ , it shows battery as 'NN' and gives following output: 虽然如果我POS通过此标记器http://cst.dk/online/pos_tagger/uk/标记了相同的语句,则它将电池显示为'NN'并给出以下输出:

Camera/NNP picture/NN quality/NN was/VBD fair/JJ but/CC speed/NN was/VBD an/DT issue/NN and/CC also/RB battery/NN life/NN was/VBD not/RB that/IN good/JJ 相机/ NNP图片/ NN质量/ NN是/ VBD正常/ JJ,但/ CC速度/ NN是/ VBD an / DT问题/ NN和/ CC也/ RB电池/ NN寿命/ NN是/ VBD否/ RB那在好/ JJ

Edit : 编辑

With statement as : 声明为:

"Camera picture quality was fair but speed was an issue but battery life was not that good" “相机的图像质量还算不错,但是速度是一个问题, 但是电池寿命不是很好”

the NLTK tagger gives following output: NLTK标记器提供以下输出:

[('Camera', 'NNP'), ('picture', 'NN'), ('quality', 'NN'), ('was', 'VBD'), ('fair', 'JJ'), ('but', 'CC'), ('speed', 'NN'), ('was', 'VBD'), ('an', 'DT'), ('issue', 'NN'), ('but', 'CC'), ('battery', 'NN'), ('life', 'NN'), ('was', 'VBD'), ('not', 'RB'), ('that', 'IN'), ('good', 'JJ')] [('Camera','NNP'),('picture','NN'),('quality','NN'),('was','VBD'),('fair','JJ') ,('but','CC'),('speed','NN'),('was','VBD'),('an','DT'),('issue','NN') ,('but','CC'),('battery','NN'),('life','NN'),('was','VBD'),('not','RB') ,('that','IN'),('good','JJ')]

Please explain! 请解释!

It seems like the only difference is that cst.dk tagged battery as NN and NLTK tagged battery as RB (adverb). 似乎唯一的区别是cst.dk将battery标记为NN ,将NLTK将电池标记为RB (副词)。

>>> cstdk_output = "Camera/NNP picture/NN quality/NN was/VBD fair/JJ but/CC speed/NN was/VBD an/DT issue/NN and/CC also/RB battery/NN life/NN was/VBD not/RB that/IN good/JJ"
>>> cstdk_postags = [tuple(j for j in i.split('/')) for i in cstdk_output.split()]
>>> from nltk import pos_tag
>>> sent = [i for i,j in cstdk_postags]
>>> nltk_postags = pos_tag(sent)
>>> diff = [(i[0],i[1],j[1]) for i,j in zip(cstdk_postags, nltk_postags) if i[1] != j[1]]
>>> diff
[('battery', 'NN', 'RB')]

There is not much to explain. 没有太多解释。 It's a statistical trained system using Maximum Entropy, see _POS_TAGGER in http://www.nltk.org/_modules/nltk/tag.html#pos_tag , so it is bound to make mistake. 这是一个经过统计训练的系统,使用了最大熵,请参见http://www.nltk.org/_modules/nltk/tag.html#pos_tag中的 _POS_TAGGER ,因此肯定会出错。 See other mistakes it makes, POS tagging - NLTK thinks noun is adjective 查看其他错误, POS标记-NLTK认为名词是形容词

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM