简体   繁体   中英

NLTK - WordNet: list of long words

I would like to find words in WordNet that are at least 18 character long. I tried the following code:

from nltk.corpus import wordnet as wn
sorted(w for w in wn.synset().name() if len(w)>18)

I get the following error message:

 sorted(w for w in wn.synset().name() if len(w)>18) 

TypeError: synset() missing 1 required positional argument: 'name'

I am using Python 3.4.3.

How can I fix my code?

Use wn.all_lemma_names() to get a list of all lemmas. I believe that's all the words you'll get out of Wordnet, so there should be no need to iterate over synsets (but you could call up the synsets for each lemma if you are so inclined).

You'll probably want to sort your hits by length:

longwords = [ n for n in wn.all_lemma_names() if len(n) > 18 ]
longwords.sort(key=len, reverse=True)

Before the answer, you need to know how the wordnet interface in NLTK works, see http://www.nltk.org/howto/wordnet.html

Wordnet is indexed by concepts that can be represented by different words contains semantic information about. And the Wordnet interface in NLTK let's you search the concepts that a word can represent, eg:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> for ss in wn.synsets('dog'):
...     print ss, ss.definition()
... 
Synset('dog.n.01') a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
Synset('frump.n.01') a dull unattractive unpleasant girl or woman
Synset('dog.n.03') informal term for a man
Synset('cad.n.01') someone who is morally reprehensible
Synset('frank.n.02') a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
Synset('pawl.n.01') a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
Synset('andiron.n.01') metal supports for logs in a fireplace
Synset('chase.v.01') go after with the intent to catch

To access all synsets in wordnet:

wn.all_synsets()

And for each synsets, there are different functions that you can look up regarding the synsets, eg

>>> ss = wn.synsets('dog')[0] # First synset for the word 'dog'
>>> ss.definition()
u'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'
>>> ss.hypernyms()
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
>>> ss.hyponyms()
[Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]
>>> ss.name()
u'dog.n.01'
>>> ss.lemma_names() # Other words that can represent this concept.
[u'dog', u'domestic_dog', u'Canis_familiaris']

So you can do it with a one liner, it's not so readable:

sorted(ss.name() for ss in wn.all_synsets() if len(ss.name())>18)

But note that that will only give you a list of lemma names that are the Synsets' indices. Also, you're including the POS tag and the index ID (ie .s.01 in the synset's indexed name: absorbefacient.s.01 ) when you check for len(ss.name()) > 18 .

So what you need is the lemma_names() instead of the name() .

>>> from itertools import chain
>>> sorted(lemma for lemma in chain(*(ss.lemma_names() for ss in wn.all_synsets())) if len(lemma) > 18)

Alternatively, you can check the length while you collect the lemma before chaining and sorting them:

>>> sorted(chain(*([lemma for lemma in ss.lemma_names() if len(lemma)>18] for ss in wn.all_synsets())))

Note : By iterating through the synsets and getting the lemma_names() , you will get duplicates and also lemma_names() that are caps initial vs lemma names that are not.

And of course, you don't need to loop through all that trouble, since there's a built-in function

>>> sorted(lemma for lemma in wn.all_lemma_names() if len(lemma) > 18)

The synset function requires an argument.

sorted(w for w in wn.synset('WORD').name() if len(w)>18)

http://www.nltk.org/howto/wordnet.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM