[英]Cosine Similarity [Python]
使用我的函數的以下代碼來計算查詢與數據的余弦相似度:
def rank_retrieve(self, query):
"""
Given a query (a list of words), return a rank-ordered list of
documents and score for the query.
self.docs : list of documents
self.docs[i] : list of words in doc number i -> [word1,word2,...,wordN]
self.boolean_retrieve(query) : giving a list of words this return the index of
documents wich contains all of these words.
self.tfidf(word,documentIndex) : returns the value tfidf of a word in a document
self.get_posting(word): returns a list of document index where that word appears
"""
scores = [0.0 for xx in range(len(self.docs))]
# Apply Cosine Similarity
for i in self.boolean_retrieve(query):
normDoc = 0.0
normQuery = 0.0
dt = 0.0
qtdt = 0.0
for word in query:
dt = self.get_tfidf(word,i)
normDoc+= math.pow(dt,2)
qt = 1.0 + ( math.log10( len(query) ) )
normQuery+=math.pow(qt,2)
qtdt += dt * qt
scores[i] = qtdt / ( math.sqrt(normDoc) )
return scores
我只有下一個主題:
那么,您能幫我做我的代碼嗎? 我返回錯誤的值,我也不知道為什么。 謝謝。
doc 56得分的結果:
Cosine Similarity Test
doc 56, query : ['separ', 'of', 'church', 'and', 'state']
Separ:
QTDT: 0.105587429399
DT 0.0621479067488
QT 1.69897000434
NormDoc: 0.00386236231326 normQuery 2.88649907563
Of:
QTDT: 0.105587429399
DT 0.0
QT 1.69897000434
NormDoc: 0.00386236231326 normQuery 5.77299815127
Church :
QTDT: 0.653857934128
DT 0.322707583613
QT 1.69897000434
NormDoc: 0.108002546834 normQuery 8.6594972269
And:
QTDT: 0.653857934128
DT 0.0
QT 1.69897000434
NormDoc: 0.108002546834 normQuery 11.5459963025
State:
QTDT: 0.674927180008
DT 0.0124011876763
QT 1.69897000434
NormDoc: 0.10815633629 normQuery 14.4324953782
Scores of 56 must be 0.010676611271744128 found : 2.05225316563
您是否包含紅色字詞? 參考解決方案是否包含它們?
對於獲得相同的排名 ,它們不是必需的,但如果我沒有記錯的話,它們應該對數值有影響。
另外,由於使用了這些1+log
術語,您對余弦相似度的定義似乎是不標准的。
def calctfidfvec(tfvec, withidf):
tfidfvec = {}
veclen = 0.0
for token in tfvec:
if withidf:
tfidf = (1+log10(tfvec[token])) * getidf(token)
else:
tfidf = (1+log10(tfvec[token]))
tfidfvec[token] = tfidf
veclen += pow(tfidf,2)
if veclen > 0:
for token in tfvec:
tfidfvec[token] /= sqrt(veclen)
return tfidfvec
def cosinesim(vec1, vec2):
commonterms = set(vec1).intersection(vec2)
sim = 0.0
for token in commonterms:
sim += vec1[token]*vec2[token]
return sim
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.