简体   繁体   English

NLTK Brill Tagger拆分单词

[英]NLTK Brill Tagger Splitting Words

I am using python version 3.4.1 and NLTK version 3 and I am trying to use their Brill Tagger. 我正在使用python版本3.4.1和NLTK版本3,并且尝试使用其Brill Tagger。

Here is the training code for the brill tagger: 这是brill标记器的培训代码:

import nltk
from nltk.tag.brill import *
import nltk.tag.brill_trainer as bt
from nltk.corpus import brown

Template._cleartemplates()
templates = fntbl37()
tagged_sentences = brown.tagged_sents(categories = 'news')
tagged_sentences = tagged_sentences[:]
tagger = nltk.tag.BigramTagger(tagged_sentences)
tagger = bt.BrillTaggerTrainer(tagger, templates, trace=3)
tagger = tagger.train(tagged_sentences, max_rules=250)
print(tagger.evaluate(brown.tagged_sents(categories='fiction')[:]))
print(tagger.tag("Hi I am Harry Potter."))

The output to the last command however is: 但是,最后一条命令的输出为:

[('H', 'NN'), ('i', 'NN'), (' ', 'NN'), ('I', 'NN'), (' ', 'NN'), ('a', 'AT'), ('m', 'NN'), (' ', 'NN'), ('H', 'NN'), ('a', 'AT'), ('r', 'NN'), ('r', 'NN'), ('y', 'NN'), (' ', 'NN'), ('P', 'NN'), ('o', 'NN'), ('t', 'NN'), ('t', 'NN'), ('e', 'NN'), ('r', 'NN'), ('.', '.')]

How do I stop it from splitting the words into letters and tagging the letters instead of the word? 如何阻止将单词拆分为字母并标记字母而不是单词?

Tag tag() function expects a list of tokens as input. Tag tag()函数期望将令牌列表作为输入。 Since you give it a string as input, this string gets interpreted as a list. 由于您将其作为输入字符串,因此该字符串将被解释为列表。 Turning a string into a list gives you a list of characters: 将字符串转换为列表将为您提供字符列表:

>>> list("abc")
['a', 'b', 'c']

All you need to do is turn your string into a list of tokens before tagging. 您需要做的就是在标记之前将字符串转换为令牌列表。 For example with nltk or simply by splitting at whitespaces: 例如,使用nltk或仅通过在空格处分割即可:

>>> import nltk
>>> nltk.word_tokenize("Hi I am Harry Potter.")
['Hi', 'I', 'am', 'Harry', 'Potter', '.']
>>> "Hi I am Harry Potter.".split(' ')
['Hi', 'I', 'am', 'Harry', 'Potter.']

Adding tokenization in the tagging gives the following result: 在标记中添加标记化将得到以下结果:

print(tagger.tag(nltk.word_tokenize("Hi I am Harry Potter.")))
[('Hi', 'NN'), ('I', 'PPSS'), ('am', 'VB'), ('Harry', 'NN'), ('Potter', 'NN'), ('.', '.')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM