簡體   English   中英

余弦相似度[Python]

[英]Cosine Similarity [Python]

使用我的函數的以下代碼來計算查詢與數據的余弦相似度:

def rank_retrieve(self, query):
        """
        Given a query (a list of words), return a rank-ordered list of
        documents and score for the query.
        self.docs : list of documents
        self.docs[i] : list of words in doc number i -> [word1,word2,...,wordN]
        self.boolean_retrieve(query) : giving a list of words this return the index of
        documents wich contains all of these words.
        self.tfidf(word,documentIndex) : returns the value tfidf of a word in a document
        self.get_posting(word): returns a list of document index where that word appears
        """
    scores = [0.0 for xx in range(len(self.docs))]

    # Apply Cosine Similarity
    for i in self.boolean_retrieve(query):
        normDoc = 0.0
        normQuery = 0.0
        dt = 0.0
        qtdt = 0.0
        for word in query:
            dt = self.get_tfidf(word,i)
            normDoc+= math.pow(dt,2)

            qt = 1.0 + ( math.log10( len(query) ) )
            normQuery+=math.pow(qt,2)

            qtdt += dt * qt
        scores[i] = qtdt / ( math.sqrt(normDoc) )

    return scores

我只有下一個主題: 在此處輸入圖片說明

那么,您能幫我做我的代碼嗎? 我返回錯誤的值,我也不知道為什么。 謝謝。

doc 56得分的結果:

Cosine Similarity Test
doc 56, query :   ['separ', 'of', 'church', 'and', 'state']

Separ: 
QTDT:  0.105587429399 
DT 0.0621479067488 
QT 1.69897000434 
NormDoc:  0.00386236231326  normQuery 2.88649907563

Of:
QTDT:  0.105587429399 
DT 0.0 
QT 1.69897000434 
NormDoc:  0.00386236231326  normQuery 5.77299815127

Church :
QTDT:  0.653857934128 
DT 0.322707583613 
QT 1.69897000434 
NormDoc:  0.108002546834  normQuery 8.6594972269

And:
QTDT:  0.653857934128 
DT 0.0 
QT 1.69897000434 
NormDoc:  0.108002546834  normQuery 11.5459963025

State:
QTDT:  0.674927180008 
DT 0.0124011876763 
QT 1.69897000434 
NormDoc:  0.10815633629  normQuery 14.4324953782

Scores of 56 must be 0.010676611271744128 found : 2.05225316563

您是否包含紅色字詞? 參考解決方案是否包含它們?

對於獲得相同的排名 ,它們不是必需的但如果我沒有記錯的話它們應該對數值有影響。

另外,由於使用了這些1+log術語,您對余弦相似度的定義似乎是不標准的。

https://zh.wikipedia.org/wiki/余弦相似度

def calctfidfvec(tfvec, withidf):
    tfidfvec = {}
    veclen = 0.0

    for token in tfvec:
        if withidf:
            tfidf = (1+log10(tfvec[token])) * getidf(token)
        else:
            tfidf = (1+log10(tfvec[token]))
        tfidfvec[token] = tfidf 
        veclen += pow(tfidf,2)

    if veclen > 0:
        for token in tfvec: 
            tfidfvec[token] /= sqrt(veclen)

    return tfidfvec

def cosinesim(vec1, vec2):
    commonterms = set(vec1).intersection(vec2)
    sim = 0.0
    for token in commonterms:
        sim += vec1[token]*vec2[token]

    return sim

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM