[英]Why is pos_tag in NLTK tagging “please” as NN?
I have a serious problem: I have downloaded last version of NLTK and I got a strange POS output: 我有一个严重的问题:我下载了NLTK的最新版本,但得到了一个奇怪的POS输出:
import nltk
import re
sample_text="start please with me"
tokenized = nltk.sent_tokenize(sample_text)
for i in tokenized:
words=nltk.word_tokenize(i)
tagged=nltk.pos_tag(words)
chunkGram=r"""Chank___Start:{<VB|VBZ>*} """
chunkParser=nltk.RegexpParser(chunkGram)
chunked=chunkParser.parse(tagged)
print(chunked)
[out]: [OUT]:
(S start/JJ please/NN with/IN me/PRP) (请以S start / JJ / NN with / IN me / PRP)
I do not know why "start" is tagged as JJ
and "please" as NN
? 我不知道为什么“开始”标记为JJ
,“请”标记为NN
?
The default NLTK pos_tag
has somehow learnt that please
is a noun. 默认的NLTK pos_tag
以某种方式得知, please
成为名词。 And that's not correct in almost any case in proper English, eg 在几乎所有情况下,用适当的英语都是不正确的,例如
>>> from nltk import pos_tag
>>> pos_tag('Please go away !'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please'.split())
[('Please', 'VB')]
>>> pos_tag('please'.split())
[('please', 'NN')]
>>> pos_tag('please !'.split())
[('please', 'NN'), ('!', '.')]
>>> pos_tag('Please !'.split())
[('Please', 'NN'), ('!', '.')]
>>> pos_tag('Would you please go away ?'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]
>>> pos_tag('Would you please go away !'.split())
[('Would', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('go', 'VB'), ('away', 'RB'), ('!', '.')]
>>> pos_tag('Please go away ?'.split())
[('Please', 'NNP'), ('go', 'VB'), ('away', 'RB'), ('?', '.')]
Using WordNet as a benchmark, there shouldn't be a case where please
is a noun. 以WordNet为基准,在任何情况下都please
取名词。
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('please')
[Synset('please.v.01'), Synset('please.v.02'), Synset('please.v.03'), Synset('please.r.01')]
But I think this is largely due to the text which was used to train the PerceptronTagger
rather than the implementation of the tagger itself. 但是我认为这主要是由于用于训练PerceptronTagger
的文本而不是标记器本身的实现。
Now, we take a look at what's inside the pre-trained PerceptronTragger
, we see that it only knows 1500+ words: 现在,我们来看看预训练的PerceptronTragger
里面的内容,我们看到它只知道1500多个单词:
>>> from nltk import PerceptronTagger
>>> tagger = PerceptronTagger()
>>> tagger.tagdict['I']
'PRP'
>>> tagger.tagdict['You']
'PRP'
>>> tagger.tagdict['start']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'start'
>>> tagger.tagdict['Start']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'Start'
>>> tagger.tagdict['please']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'please'
>>> tagger.tagdict['Please']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'Please'
>>> len(tagger.tagdict)
1549
One trick you can do is to hack the tagger: 您可以采取的一种技巧是入侵标记器:
>>> tagger.tagdict['start'] = 'VB'
>>> tagger.tagdict['please'] = 'VB'
>>> tagger.tag('please start with me'.split())
[('please', 'VB'), ('start', 'VB'), ('with', 'IN'), ('me', 'PRP')]
But the most logical thing to do is to simply retrain the tagger, see http://www.nltk.org/_modules/nltk/tag/perceptron.html#PerceptronTagger.train 但是最合乎逻辑的做法是简单地重新训练标记器,请参阅http://www.nltk.org/_modules/nltk/tag/perceptron.html#PerceptronTagger.train
And if you don't want to retrain a tagger, then see Python NLTK pos_tag not returning the correct part-of-speech tag 而且,如果您不想重新训练标记器,请参阅Python NLTK pos_tag不返回正确的词性标记
Most probably, using the StanfordPOSTagger
gets you what you need: 最有可能的是,使用StanfordPOSTagger
可以满足您的需求:
>>> from nltk import StanfordPOSTagger
>>> sjar = '/home/alvas/stanford-postagger/stanford-postagger.jar'
>>> m = '/home/alvas/stanford-postagger/models/english-left3words-distsim.tagger'
>>> spos_tag = StanfordPOSTagger(m, sjar)
>>> spos_tag.tag('Please go away !'.split())
[(u'Please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Please'.split())
[(u'Please', u'VB')]
>>> spos_tag.tag('Please !'.split())
[(u'Please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please !'.split())
[(u'please', u'VB'), (u'!', u'.')]
>>> spos_tag.tag('please'.split())
[(u'please', u'VB')]
>>> spos_tag.tag('Would you please go away !'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'!', u'.')]
>>> spos_tag.tag('Would you please go away ?'.split())
[(u'Would', u'MD'), (u'you', u'PRP'), (u'please', u'VB'), (u'go', u'VB'), (u'away', u'RB'), (u'?', u'.')]
For Linux: See https://gist.github.com/alvations/e1df0ba227e542955a8a 对于Linux:请参阅https://gist.github.com/alvations/e1df0ba227e542955a8a
For Windows: See https://gist.github.com/alvations/0ed8641d7d2e1941b9f9 对于Windows:请参见https://gist.github.com/alvations/0ed8641d7d2e1941b9f9
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.