簡體   English   中英

計算從4個mysql表中檢索到的所有可能的文本對的余弦相似度

[英]Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables

我有4個帶有架構的表(app,text_id,title,text)。 現在,我想計算所有可能的文本對(標題和文本串聯在一起)之間的余弦相似度,並將它們最終存儲在具有字段(app1,app2,text_id1,text1,text_id2,text2,cosine_likeity)的csv文件中。

由於存在很多可能的組合,因此應該可以高效運行。 這里最常用的方法是什么? 我將不勝感激任何指針。

編輯:雖然提供的參考可能會解決我的問題,但我仍然不知道如何解決這個問題。 有人可以提供有關完成此任務的策略的更多詳細信息嗎? 除了計算出的余弦相似度,我還需要相應的文本對作為輸出。

以下是一個最小的示例,用於計算一組文檔之間的成對余弦相似度(假設您已成功從數據庫中檢索了標題和文本)。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Assume thats the data we have (4 short documents)
data = [
    'I like beer and pizza',
    'I love pizza and pasta',
    'I prefer wine over beer',
    'Thou shalt not pass'
]

# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(data) # `X` will now be a TF-IDF representation of the data, the first row of `X` corresponds to the first sentence in `data`

# Calculate the pairwise cosine similarities (depending on the amount of data that you are going to have this could take a while)
S = cosine_similarity(X)

'''
S looks as follows:
array([[ 1.        ,  0.4078538 ,  0.19297924,  0.        ],
       [ 0.4078538 ,  1.        ,  0.        ,  0.        ],
       [ 0.19297924,  0.        ,  1.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

The first row of `S` contains the cosine similarities to every other element in `X`. 
For example the cosine similarity of the first sentence to the third sentence is ~0.193. 
Obviously the similarity of every sentence/document to itself is 1 (hence the diagonal of the sim matrix will be all ones). 
Given that all indices are consistent it is straightforward to extract the corresponding sentences to the similarities.
'''

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM