I want to get the length of words in the WordNet
corpus
Code:
from nltk.corpus import wordnet as wn
len_wn = len([word.lower() for word in wn.words()])
print(len_wn)
I get the output as 147306
My Questions:
WordNet
?tokens
such as zoom_in
counts as word
? Depends on what is the definition of "words". The wn.words()
function iterate through all the lemma_names
, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1701 and https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1191
def words(self, lang="eng"):
"""return lemmas of the given language as list of words"""
return self.all_lemma_names(lang=lang)
def all_lemma_names(self, pos=None, lang="eng"):
"""Return all lemma names for all synsets for the given
part of speech tag and language or languages. If pos is
not specified, all synsets for all parts of speech will
be used."""
if lang == "eng":
if pos is None:
return iter(self._lemma_pos_offset_map)
else:
return (
lemma
for lemma in self._lemma_pos_offset_map
if pos in self._lemma_pos_offset_map[lemma]
)
So if the definition of "words" is all the possible lemma, then yes, this function gives you the total length of words in the lemma names in Wordnet:
>>> sum(len(lemma_name) for lemma_name in wn.words())
1692291
>>> sum(len(lemma_name.lower()) for lemma_name in wn.words())
1692291
Lowercasing is not necessary because lemma names should have came lowered. Even named entities, eg
>>> 'new_york' in wn.words()
True
But do note that the same lemma can have very similar lemma names:
>>> 'new_york' in wn.words()
True
>>> 'new_york_city' in wn.words()
True
That's because of how wordnet is structured. The API in NLTK organizes "meaning" as synset, a synset contains is linked to multiple lemmas and each lemma comes with at least one name:
>>> wn.synset('new_york.n.1')
Synset('new_york.n.01')
>>> wn.synset('new_york.n.1').lemmas()
[Lemma('new_york.n.01.New_York'), Lemma('new_york.n.01.New_York_City'), Lemma('new_york.n.01.Greater_New_York')]
>>> wn.synset('new_york.n.1').lemma_names()
['New_York', 'New_York_City', 'Greater_New_York']
But each "word" you query can have multiple synsets (ie multiple meaning), eg
>>> wn.synsets('new_york')
[Synset('new_york.n.01'), Synset('new_york.n.02'), Synset('new_york.n.03')]
Depends on what's the definition of a "word", like the example above, if you iterate through the wn.words()
, you are iterating through the lemma_names and the new_york
examples shows that multi-word expressions exists in the lemma name lists for each synset.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.