简体   繁体   English

如何比较两个列表并返回列表中单词的最高相似度

[英]How to compare two lists and return the highest similarity of words in a list

I have a list我有一个清单

list1 = ['good']

I have another list with synonyms of the word "good"我有另一个列表,其中包含“好”这个词的同义词

list2 = ['unspoilt', 'respectable', 'honorable', 'undecomposed', 'goodness', 'well', 'near', 'commodity', 'safe', 'dear', 'just', 'secure', 'in_force', 'practiced', 'trade_good', 'proficient', 'expert', 'good', 'sound', 'soundly', 'effective', 'in_effect', 'beneficial', 'dependable', 'unspoiled', 'estimable', 'salutary', 'adept', 'full', 'ripe','upright', 'skilful', 'right', 'serious', 'skillful', 'thoroughly','honest']

Now i wanted to list the word with maximum similarity Is it possible?现在我想列出具有最大相似度的单词 有可能吗?

suppose if the word good has a similarity greater than 0.8 then i wanted to return those words alone in a list假设如果单词 good 的相似度大于 0.8,那么我想在列表中单独返回这些单词

here let me consider unspoilt has similarity around 0.9这里让我考虑一下未受破坏的相似度约为 0.9

max_similar_list = ['unspoilt']

Here I am using the concept of probability .Those word that have the higher probability of word in a list is the highest similar word in the list .这里我使用了概率的概念。列表中单词出现概率较高的单词是列表中相似度最高的单词。

Try this code !试试这个代码! I am also attach the screenshot of the output .我还附上了输出的截图。

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

list1 = ['good']
list2 = ['unspoilt', 'respectable', 'honorable', 'undecomposed', 'goodness', 'well', 'near', 'commodity', 'safe', 'dear', 'just', 'secure', 'in_force', 'practiced', 'trade_good', 'proficient', 'expert', 'good', 'sound', 'soundly', 'effective', 'in_effect', 'beneficial', 'dependable', 'unspoiled', 'estimable', 'salutary', 'adept', 'full', 'ripe','upright', 'skilful', 'right', 'serious', 'skillful', 'thoroughly','honest']
max_similar_list =[]
max=similar(list1[0],list2[1])
max_similar_list=list2[0]
for i in range(1,len(list2)):
  if (max < similar(list1[0],list2[i]) ):
    max=similar(list1[0],list2[i])
    max_similar_list=list2[i]
print("The highest similarity of words in list is '" , max_similar_list , "' with the probabilty of " , max)

在此处输入图片说明

For this, you need to define some way to find similarity between a set of words.为此,您需要定义某种方法来查找一组单词之间的相似性。 One way to do this can be Word2Vec which generates word embeddings.一种方法可以是 Word2Vec,它生成词嵌入。 Gensim has a good implementation of word2vec, read more here : Gensim 有一个很好的 word2vec 实现,在这里阅读更多:

https://radimrehurek.com/gensim/models/word2vec.html https://radimrehurek.com/gensim/models/word2vec.html

For word2Vec, you need corpora to train the model and then make vector embeddings for the given set of words.对于 word2Vec,您需要语料库来训练模型,然后为给定的词集进行向量嵌入。 Then you find the word closest to it using any distance function (eg cosine)然后你使用任何距离函数(例如余弦)找到最接近它的单词

Here is a sample code :这是一个示例代码:

#imports
from nltk.corpus import brown
import numpy as np
from gensim.models import Word2Vec

#Using brown corpus (category news) from nltk. Replace by your corpus with suitable words/sentences
sentences =brown.sents(categories = 'news')

#initialize and train model
model = Word2Vec(min_count=1)
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

# find similarity between two words
model.wv.similarity('good','well')

0.99978923463065106 0.99978923463065106

PS : Here, I'm comparing two words, you can use other methods too which give you most similar word from the corpus. PS:在这里,我比较两个词,您也可以使用其他方法,从语料库中为您提供最相似的词。 Be careful about words not in corpus.小心不在语料库中的单词。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM