简体   繁体   中英

Getting a Large List of Nouns (or Adjectives) in Python with NLTK; or Python Mad Libs

Like this question , I am interested in getting a large list of words by part of speech (a long list of nouns; a list of adjectives) to be used programmatically elsewhere. This answer has a solution using the WordNet database (in SQL) format.

Is there a way to get at such list using the corpora/tools built into the Python NLTK. I could take a large bunch of text, parse it and then store the nouns and adjectives. But given the dictionaries and other tools built in, is there a smarter way to simply extract the words that are already present in the NLTK datasets, encoded as nouns/adjectives (whatever)?

Thanks.

It's worth noting that Wordnet is actually one of the corpora included in the NLTK downloader by default. So you could conceivably just use the solution you already found without having to reinvent any wheels.

For instance, you could just do something like this to get all noun synsets:

from nltk.corpus import wordnet as wn

for synset in list(wn.all_synsets('n')):
    print synset

# Or, equivalently
for synset in list(wn.all_synsets(wn.NOUN)):
    print synset

That example will give you every noun that you want and it will even group them into their synsets so you can try to be sure that they're being used in the correct context.

If you want to get them all into a list you can do something like the following (though this will vary quite a bit based on how you want to use the words and synsets):

all_nouns = []
for synset in wn.all_synsets('n'):
    all_nouns.extend(synset.lemma_names())

Or as a one-liner:

all_nouns = [word for synset in wn.all_synsets('n') for word in synset.lemma_names()]

You should use the Moby Parts of Speech Project data. Don't be fixated on using only what is directly in NLTK by default. It would be little work to download the files for this and pretty easy to parse them with NLTK once loaded.

I saw a similar question earlier this week (can't find the link), but like I said then, I don't think maintaining a list of nouns/adjectives/whatever is a great idea. This is primarily because the same word can have different parts of speech, depending on the context.

However, if you are still dead set on using these lists, then here's how I would do it (I don't have a working NLTK install on this machine, but I remember the basics):

nouns = set()
for sentence in my_corpus.sents():
    # each sentence is either a list of words or a list of (word, POS tag) tuples
    for word, pos in nltk.pos_tag(sentence): # remove the call to nltk.pos_tag if `sentence` is a list of tuples as described above

        if pos in ['NN', "NNP"]: # feel free to add any other noun tags
            nouns.add(word)

Hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM