简体   繁体   中英

POS tagging - NLTK- Python

I want to use word_tokenize, pos_tag, FreqDist . I don't want to download all nltk as default. I want to use nltk.download(info_or_id='') . What options I should put in info_or_id to get the POS tagging and its frequency. POS tagging - Penn Treebank POS.

If you look at the corpora http://www.nltk.org/nltk_data/ each description includes its id, eg brown, wordnet, book_grammars. Which you choose is up to you, depends on your application. Look for a tagged corpora, eg Brown include POS, you'll have to look at each one, I guess, to see. Treebank mentions Penn Treebank (id treebank), also Sinica Treebank (id sinica_treebank). See below heading Parsed Corpora here http://www.nltk.org/howto/corpus.html

Your question confuses the nltk itself with nltk_data . You can't really download just part of the nltk (though you could manually trim it down, carefully, if you need to save space). But I think you're trying to avoid downloading all of the nltk data. As @barny wrote, you can see the IDs of different resources when you open the interactive nltk.download() window.

  1. To use the treebank pos tagger, you need its pickled training tables ( not the treebank corpus); you'll find them in the "Models" tab under the ID maxent_treebank_pos_tagger . (Hence: nltk.download("maxent_treebank_pos_tagger") .

  2. The FreqDist class doesn't have or need any trained model.

  3. Neither does word_tokenize , which takes a sentence as a single string and breaks it up into words. However, you'll probably need the model for sent_tokenize , which breaks up a longer text into sentences. That's handled by the "Punkt" sentence tokenizer, and you can download its model with nltk.download("punkt") .

PS. For general-purpose use, I recommend downloading everything in the "book" collection, ie nltk.download("book") . It's only a fraction of the total, and it lets you do most things without scrambling every so often to figure out what's missing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM