I found a python tutorial on the web for calculating tf-idf and cosine similarity. I am trying to play with it and change it a bit.
The problem is that I have weird results and almost without any sense.
For example I am using 3 documents. [doc1,doc2,doc3]
doc1 and doc2 are similars and doc3 are totaly different.
The results are here:
[[ 0.00000000e+00 2.20351188e-01 9.04357868e-01]
[ 2.20351188e-01 -2.22044605e-16 8.82546765e-01]
[ 9.04357868e-01 8.82546765e-01 -2.22044605e-16]]
First, I thought that the numbers on the main diagonal should be 1 and not 0. After that, the similarity score for doc1 and doc2 is around 0.22 and doc1 with doc3 around 0.90. I expected the opposite results. Could you please check my code and maybe help me understand why I have those results?
Doc1, doc2 and doc3 are tokkenized texts.
articles = [doc1,doc2,doc3]
corpus = []
for article in articles:
for word in article:
corpus.append(word)
def freq(word, article):
return article.count(word)
def wordCount(article):
return len(article)
def numDocsContaining(word,articles):
count = 0
for article in articles:
if word in article:
count += 1
return count
def tf(word, article):
return (freq(word,article) / float(wordCount(article)))
def idf(word, articles):
return math.log(len(articles) / (1 + numDocsContaining(word,articles)))
def tfidf(word, document, documentList):
return (tf(word,document) * idf(word,documentList))
feature_vectors=[]
for article in articles:
vec=[]
for word in corpus:
if word in article:
vec.append(tfidf(word, article, corpus))
else:
vec.append(0)
feature_vectors.append(vec)
n=len(articles)
mat = numpy.empty((n, n))
for i in xrange(0,n):
for j in xrange(0,n):
mat[i][j] = nltk.cluster.util.cosine_distance(feature_vectors[i],feature_vectors[j])
print mat
if you can try any other package such as sklearn then try it
this code might help
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import numpy.linalg as LA
from sklearn.feature_extraction.text import TfidfVectorizer
f = open("/root/Myfolder/scoringDocuments/doc1")
doc1 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc2")
doc2 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc3")
doc3 = str.decode(f.read(), "UTF-8", "ignore")
train_set = [doc1, doc2, doc3]
test_set = ["age salman khan wife"] #Query
stopWords = stopwords.words('english')
tfidf_vectorizer = TfidfVectorizer(stop_words = stopWords)
tfidf_matrix_test = tfidf_vectorizer.fit_transform(test_set)
print tfidf_vectorizer.vocabulary_
tfidf_matrix_train = tfidf_vectorizer.transform(train_set) #finds the tfidf score with normalization
print 'Fit Vectorizer to train set', tfidf_matrix_train.todense()
print 'Transform Vectorizer to test set', tfidf_matrix_test.todense()
print "\n\ncosine simlarity not separated sets cosine scores ==> ", cosine_similarity(tfidf_matrix_test, tfidf_matrix_train)
refer to this tutorials part-I , part-II , part-III . This can help.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.