[英]NLTK Brill Tagger Splitting Words
I am using python version 3.4.1 and NLTK version 3 and I am trying to use their Brill Tagger. 我正在使用python版本3.4.1和NLTK版本3,并且尝试使用其Brill Tagger。
Here is the training code for the brill tagger: 这是brill标记器的培训代码:
import nltk
from nltk.tag.brill import *
import nltk.tag.brill_trainer as bt
from nltk.corpus import brown
Template._cleartemplates()
templates = fntbl37()
tagged_sentences = brown.tagged_sents(categories = 'news')
tagged_sentences = tagged_sentences[:]
tagger = nltk.tag.BigramTagger(tagged_sentences)
tagger = bt.BrillTaggerTrainer(tagger, templates, trace=3)
tagger = tagger.train(tagged_sentences, max_rules=250)
print(tagger.evaluate(brown.tagged_sents(categories='fiction')[:]))
print(tagger.tag("Hi I am Harry Potter."))
The output to the last command however is: 但是,最后一条命令的输出为:
[('H', 'NN'), ('i', 'NN'), (' ', 'NN'), ('I', 'NN'), (' ', 'NN'), ('a', 'AT'), ('m', 'NN'), (' ', 'NN'), ('H', 'NN'), ('a', 'AT'), ('r', 'NN'), ('r', 'NN'), ('y', 'NN'), (' ', 'NN'), ('P', 'NN'), ('o', 'NN'), ('t', 'NN'), ('t', 'NN'), ('e', 'NN'), ('r', 'NN'), ('.', '.')]
How do I stop it from splitting the words into letters and tagging the letters instead of the word? 如何阻止将单词拆分为字母并标记字母而不是单词?
Tag tag()
function expects a list of tokens as input. Tag
tag()
函数期望将令牌列表作为输入。 Since you give it a string as input, this string gets interpreted as a list. 由于您将其作为输入字符串,因此该字符串将被解释为列表。 Turning a string into a list gives you a list of characters:
将字符串转换为列表将为您提供字符列表:
>>> list("abc")
['a', 'b', 'c']
All you need to do is turn your string into a list of tokens before tagging. 您需要做的就是在标记之前将字符串转换为令牌列表。 For example with nltk or simply by splitting at whitespaces:
例如,使用nltk或仅通过在空格处分割即可:
>>> import nltk
>>> nltk.word_tokenize("Hi I am Harry Potter.")
['Hi', 'I', 'am', 'Harry', 'Potter', '.']
>>> "Hi I am Harry Potter.".split(' ')
['Hi', 'I', 'am', 'Harry', 'Potter.']
Adding tokenization in the tagging gives the following result: 在标记中添加标记化将得到以下结果:
print(tagger.tag(nltk.word_tokenize("Hi I am Harry Potter.")))
[('Hi', 'NN'), ('I', 'PPSS'), ('am', 'VB'), ('Harry', 'NN'), ('Potter', 'NN'), ('.', '.')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.