简体   繁体   中英

Words in WordNet corpus clarification

I want to get the length of words in the WordNet corpus

Code:

from nltk.corpus import wordnet as wn

len_wn = len([word.lower() for word in wn.words()])
print(len_wn)

I get the output as 147306

My Questions:

  • Am I getting the total length of words in WordNet ?
  • Does tokens such as zoom_in counts as word ?

Am I getting the total length of words in WordNet?

Depends on what is the definition of "words". The wn.words() function iterate through all the lemma_names , https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1701 and https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1191

def words(self, lang="eng"):
    """return lemmas of the given language as list of words"""
    return self.all_lemma_names(lang=lang)


def all_lemma_names(self, pos=None, lang="eng"):
    """Return all lemma names for all synsets for the given
    part of speech tag and language or languages. If pos is
    not specified, all synsets for all parts of speech will
    be used."""

    if lang == "eng":
        if pos is None:
            return iter(self._lemma_pos_offset_map)
        else:
            return (
                lemma
                for lemma in self._lemma_pos_offset_map
                if pos in self._lemma_pos_offset_map[lemma]
            )

So if the definition of "words" is all the possible lemma, then yes, this function gives you the total length of words in the lemma names in Wordnet:

>>> sum(len(lemma_name) for lemma_name in wn.words())
1692291
>>> sum(len(lemma_name.lower()) for lemma_name in wn.words())
1692291

Lowercasing is not necessary because lemma names should have came lowered. Even named entities, eg

>>> 'new_york' in wn.words()
True

But do note that the same lemma can have very similar lemma names:

>>> 'new_york' in wn.words()
True
>>> 'new_york_city' in wn.words()
True

That's because of how wordnet is structured. The API in NLTK organizes "meaning" as synset, a synset contains is linked to multiple lemmas and each lemma comes with at least one name:

>>> wn.synset('new_york.n.1')
Synset('new_york.n.01')

>>> wn.synset('new_york.n.1').lemmas()
[Lemma('new_york.n.01.New_York'), Lemma('new_york.n.01.New_York_City'), Lemma('new_york.n.01.Greater_New_York')]

>>> wn.synset('new_york.n.1').lemma_names()
['New_York', 'New_York_City', 'Greater_New_York']

But each "word" you query can have multiple synsets (ie multiple meaning), eg

>>> wn.synsets('new_york')
[Synset('new_york.n.01'), Synset('new_york.n.02'), Synset('new_york.n.03')]

Does tokens such as zoom_in counts as word?

Depends on what's the definition of a "word", like the example above, if you iterate through the wn.words() , you are iterating through the lemma_names and the new_york examples shows that multi-word expressions exists in the lemma name lists for each synset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM