简体   繁体   中英

Clustering synonym words using NLTK and Wordnet

Given a set of words V , I would like to group the synonym words in V together. I am wondering if there is any built-in function in NLTK and Wordnet that takes V as the input and automatically cluster them based on synonymity.

I already know how to extract the synonym of each word, but this is not what I am looking for. If I do so, the problem becomes complicated when the synonym sets are intersecting each other, or being subset/superset of each other, which needs writing a function removing the conflicts.

As an example, let's consider

V = ["good","constipate","bad","nice","defective","right","respectable","powerful"]

What I want to get as output is:

[('constipate'), ('nice'), ('bad', 'defective'), ('good', 'powerful', 'respectable', 'right')]

Now based on the size/number of the clusters, some sets might split into several sets, or combine together. Here, I am just caring for the words in V and their synonyms in V .

Yes, there is a way to do using nltk and wordnet . Following is an example. I am using built in sysnets and looking for synonyms for a 'book',

import nltk
from nltk.corpus import wordnet 

synonyms = []

for syn in wordnet.synsets('book'):
        for lemma in syn.lemmas():
            synonyms.append(lemma.name())

resulting synonyms for 'book' is

print(synonyms)
>>['book', 'book', 'volume', 'record', 'record_book', 'book', 'script', 'book', 'playscript', 'ledger', 'leger', 'account_book', 'book_of_account', 'book', 'book', 'book', 'rule_book', 'Koran', 'Quran', "al-Qur'an", 'Book', 'Bible', 'Christian_Bible', ..]

length of synonyms,

 len(synonyms)
 >>38

Note: Some synonyms are verb forms, and many synonyms are just different usages of 'book'. If, instead, we take the set of synonyms, there are fewer unique words, as shown in the following code:

len(set(synonyms)) 
 >>25

After using set operation,

{'record', 'Quran', 'Holy_Scripture', 'Koran', 'Good_Book', 'playscript', 'book', 'Word_of_God', 'hold', 'Holy_Writ', 'script', 'leger', 'book_of_account', 'Scripture', 'ledger', 'reserve', 'volume', 'record_book', "al-Qur'an", 'Christian_Bible', 'Word', 'rule_book', 'Bible', 'Book', 'account_book'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM