字符串的Sklearn余弦相似度，Python

Question

I am writing an algorithm that checks how much a string is equal to another string. 我正在写一个算法来检查一个字符串等于另一个字符串。 I am using Sklearn cosine similarity. 我正在使用Sklearn余弦相似度。

My code is: 我的代码是：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

example_1 = ("I am okey", "I am okeu")
example_2 = ("I am okey", "I am crazy")

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(example_1)
result_cos = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
print(result_cos[0][1])

Running this code for example_1, prints 0.336096927276. 运行example_1的代码，将输出0.336096927276。 Running it for example_2, it prints the same score. 运行example_2，它将打印相同的分数。 The result is the same in both cases because there is only one different word. 在两种情况下，结果都是相同的，因为只有一个不同的词。

What I want is to get a higher score for example_1 because the different words "okey vs okeu" have only one different letter. 我想要获得更高的example_1分数，因为不同的单词“ okey vs okeu”只有一个不同的字母。 In contrast in example_2 there are two completely different words "okey vs crazy". 相反，在example_2中，有两个完全不同的词“ okey vs crazy”。

How can my code take in consideration that in some cases the different words are not completely different? 我的代码如何考虑某些情况下不同的词并不完全不同？

Answer 1

For short strings, Levenshtein distance will probably yield better results than cosine similarity based on words. 对于短字符串， Levenshtein距离可能会比基于单词的余弦相似度产生更好的结果。 The algorithm below is adapted from Wikibooks . 以下算法改编自Wikibooks 。 Since this is a distance metric, smaller score is better. 由于这是一个距离度量，分数越小越好。

def levenshtein(s1, s2):
    if len(s1) < len(s2):
        s1, s2 = s2, s1

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]/float(len(s1))

example_1 = ("I am okey", "I am okeu")
example_2 = ("I am okey", "I am crazy")

print(levenshtein(*example_1))
print(levenshtein(*example_2))

字符串的Sklearn余弦相似度，Python

问题描述

1 个解决方案

解决方案1
2 2017-12-09 12:46:53

字符串的Sklearn余弦相似度，Python

问题描述

1 个解决方案

解决方案1 2 2017-12-09 12:46:53

解决方案1
2 2017-12-09 12:46:53