简体   繁体   中英

Working with the python NLTK: How can I improve the accuracy of the POS tagger?

I've been using NLTK's POS tagger:

...
nltk.pos_tag(nltk.word_tokenize(tfile[i]))
...

but sometimes I get inaccurate results (NN when I should be getting JJ, and so forth. The text I want to tag is within a fairly specific business domain... I'm not quite at liberty to say what domain here). Admittedly, I'm not an expert with either Python or the NLTK (working on it, however), but I was wondering if there were some way to improve the accuracy of the tagger.

I think I understand that the tagger works by comparing the text given to it to a corpus of pretagged text. My natural inclination is to try to add a set of my own self-tagged sentences to this corpus... but I don't know how to do this.

I'd greatly appreciate any advice on how to either add (I'd prefer to add to an existing one as opposed to start a new one entirely) my own text to the corpus, or if anyone has other suggestions for improving the tagger's accuracy for my purposes, I'd love to hear it.

Thank you!

You have probably already seen the GoogleCode book on nltk . I've been working through it very slowly on my own and while I have yet to tackle POS-tagging, it's one of the things I ultimately want to do when I feel adept enough to use the tool. At any rate, in Chapter 5 , section 2 you get the following text and examples on making your own set of tagged tokens (apologies to all, but I copied directly from the text):

>>> tagged_token = nltk.tag.str2tuple('fly/NN')
>>> tagged_token
('fly', 'NN')
>>> tagged_token[0]
'fly'
>>> tagged_token[1]
'NN'

Continued from 5.2:

We can construct a list of tagged tokens directly from a string. The first step is to tokenize the string to access the individual word/tag strings, and then to convert each of these into a tuple (using str2tuple()).

>>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

That "sent" variable up above is actually what raw tagged text looks like, as confirmed by going to the nltk_data directory on my own computer and looking at anything in corpora/brown/, so you could write your own tagged text using this formatting and then build your own set of tagged tokens with it.

Once you have set-up up your own tagged tokens you should then be able to set up your own unigram tagger based on your tagged tokens (from 5.5):

>>>unigram_tagger = nltk.UnigramTagger(YOUR_OWN_TAGGED_TOKENS)

Finally, because your tagged text is likely to be a really small sample (and thus inaccurate), you can list a fallback tagger, so that when it fails, the fallback comes to the rescue:

>>> t0 = nltk.UnigramTagger(a_bigger_set_of_tagged_tokens)
>>> t1 = nltk.UnigramTagger(your_own_tagged_tokens, backoff=t0)

Lastly, you should look into the n-gram differences, bigram, unigram, etc., also covered in the aforementioned Chapter 5.

At any rate, if you continue reading through Chapter 5, you'll see a few different ways of tagging text (including my favorite: the regex tagger!). There's a lot of ways to do this and much too complex to cover adequately in a small post like this.

Caveat emptor: I haven't tried all of this code, so I offer it as a solution I am currently, myself, trying to work out. If I have made errors, please help me correct them.

"How do I improve the NLTK tagger" is a popular question :-) I definitely wouldn't advise you to hand-make a corpus to train a tagger with. Taggers need a huge amount of data to work properly on new text.

What you could do, if you want to make the effort, is "bootstrap" a corpus: Tag a bunch of text in your domain with the NLTK tagger, hand-correct the mistakes in a subset (easier if they're predictable), use the result to train a better tagger. You could even repeat the process so you can hand-clean a larger amount of your own material. Your new tagger will still based on a relatively small amount of text, so you can add the default tagger as a fallback as @erewok shows. See also this question , which asks the same thing.

As @erewok has pointed out, using a backoff tagger is a good way of improving things. Start with the most accurate. If it can't tag, or the calculated probability is below a set threshold, then try the next (less accurate) method. Even a final "assume it is a noun" step can make a measurable improvement.

Things like unigram and bigram taggers are generally not that accurate. I would recommend starting with a Naive Bayes tagger first (these are covered in the O'Reilly book). This could use a Unigram tagger or Wordnet tagger (looks the word up in Wordnet, and uses the most frequent example) as a back off tagger.

Then you could move to a MaxEnt (Maximum Entropy) tagger which is considered more accurate than Naive Bayes due to its support for dependent features. However it is slower and requires more effort to implement - the end result might not be worth it. The NLTK version can also be a bit difficult to use.

To train these taggers, NLTK does come with various corpora. Not knowing anything about your domain, I don't know how useful they will be, but they include a subset of the Penn Treebank, some domain specific corpora, various languages,etc. Have a look.

As well as the O'Reilly book, I would recommend Jacob Perkins Python Text Processing NLTK Cookbook which includes practical examples of this type of thing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM