简体   繁体   中英

To find cosine similarity between two string(names)

I am using python and scikit-learn to find the cosine similarity between two strings(specifically, names).The program is able to find the similarity score between two strings but, when strings are abbreviated, it shows some undesirable output.

eg- String1 ="K KAPOOR",String2="L KAPOOR" The cosine similarity score of these strings is 1(maximum) while the two strings are entirely different names.Is there a way to modify it, in order to get some desired results.

My code is:

# -*- coding: utf-8 -*-
"""
Created on Wed Sep  9 14:40:21 2015

@author: gauge
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents=("K KAPOOR","L KAPOOR")

tfidf_vectorizer=TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(documents)
#print tfidf_matrix.shape

cs=cosine_similarity(tfidf_matrix[0:1],tfidf_matrix)
print cs

As mentioned in the other answer, the cosine similarity is one because the two strings have the exact same representation .

That means that this code:

tfidf_vectorizer=TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(documents)

produces, well:

print(tfidf_matrix.toarray())
[[ 1.]
 [ 1.]]

This means that the two strings/documents (here the rows in the array) have the same representation.

That is because the TfidfVectorizer tokenizes your document using word tokens , and keeps only words with at least 2 characters .

So you could do one of the following:

  1. Use:

     tfidf_vectorizer=TfidfVectorizer(analyzer="char") 

to get character n-grams instead of word n-grams.

  1. Change the token pattern so that it keeps one-letter tokens:

     tfidf_vectorizer=TfidfVectorizer(token_pattern=u'(?u)\\\\b\\w+\\\\b') 

    This is just a simple modification from the default pattern you can see in the documentation . Note that I had to escape the \\b occurrences in the regular expression as I was getting an 'empty vocabulary' error.

Hope this helps.

String1 ="K KAPOOR", String2="L KAPOOR" The cosine similarity score of these strings is 1 (maximum) while the two strings are entirely different names. Is there a way to modify it, in order to get some desired results.

It depends. You are facing an issue because the vector representation of these two strings are exactly the same.

Cosine similarity between to strings is 1 because they are same . Not because they are same strings but represented with the same vector .

If you want them to be different, then you need to represent them different. To do that you need to train your algorithm with enough words that occur multiple times in a corpus.

Also it is high likely that these two strings might be converted to something like 'KAPOOR' in the preprocessing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM