简体   繁体   English

字符串的Sklearn余弦相似度,Python

[英]Sklearn cosine similarity for strings, Python

I am writing an algorithm that checks how much a string is equal to another string. 我正在写一个算法来检查一个字符串等于另一个字符串。 I am using Sklearn cosine similarity. 我正在使用Sklearn余弦相似度。

My code is: 我的代码是:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

example_1 = ("I am okey", "I am okeu")
example_2 = ("I am okey", "I am crazy")

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(example_1)
result_cos = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
print(result_cos[0][1])

Running this code for example_1, prints 0.336096927276. 运行example_1的代码,将输出0.336096927276。 Running it for example_2, it prints the same score. 运行example_2,它将打印相同的分数。 The result is the same in both cases because there is only one different word. 在两种情况下,结果都是相同的,因为只有一个不同的词。

What I want is to get a higher score for example_1 because the different words "okey vs okeu" have only one different letter. 我想要获得更高的example_1分数,因为不同的单词“ okey vs okeu”只有一个不同的字母。 In contrast in example_2 there are two completely different words "okey vs crazy". 相反,在example_2中,有两个完全不同的词“ okey vs crazy”。

How can my code take in consideration that in some cases the different words are not completely different? 我的代码如何考虑某些情况下不同的词并不完全不同?

For short strings, Levenshtein distance will probably yield better results than cosine similarity based on words. 对于短字符串, Levenshtein距离可能会比基于单词的余弦相似度产生更好的结果。 The algorithm below is adapted from Wikibooks . 以下算法改编自Wikibooks Since this is a distance metric, smaller score is better. 由于这是一个距离度量,分数越小越好。

def levenshtein(s1, s2):
    if len(s1) < len(s2):
        s1, s2 = s2, s1

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]/float(len(s1))

example_1 = ("I am okey", "I am okeu")
example_2 = ("I am okey", "I am crazy")

print(levenshtein(*example_1))
print(levenshtein(*example_2))                                   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM