简体   繁体   English

使用潜在语义分析和sklearn

[英]Use Latent Semantic Analysis with sklearn

I am trying to write a script where I will calculate the similarity of few documents. 我正在尝试编写一个脚本,我将计算几个文档的相似性。 I want to do it by using LSA. 我想通过使用LSA来做到这一点。 I have found the following code and change it a bit. 我找到了以下代码并稍微改了一下。 I has as an input 3 documents and then as output a 3x3 matrix with the similarity between them. 我输入3个文档,然后作为输出3x3矩阵,它们之间具有相似性。 I want to do the same similarity calculation but only with sklearn library. 我想进行相同的相似度计算,但只能使用sklearn库。 Is that possible? 那可能吗?

from numpy import zeros
from scipy.linalg import svd
from math import log
from numpy import asarray, sum
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity

titles = [doc1,doc2,doc3]
ignorechars = ''',:'!'''

class LSA(object):
    def __init__(self, stopwords, ignorechars):
        self.stopwords = stopwords.words('english')
        self.ignorechars = ignorechars
        self.wdict = {}
        self.dcount = 0        
    def parse(self, doc):
        words = doc.split();
        for w in words:
            w = w.lower()
            if w in self.stopwords:
                continue
            elif w in self.wdict:
                self.wdict[w].append(self.dcount)
            else:
                self.wdict[w] = [self.dcount]
        self.dcount += 1
    def build(self):
        self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
        self.keys.sort()
        self.A = zeros([len(self.keys), self.dcount])
        for i, k in enumerate(self.keys):
            for d in self.wdict[k]:
                self.A[i,d] += 1
    def calc(self):
        self.U, self.S, self.Vt = svd(self.A)
        return -1*self.Vt

    def TFIDF(self):
        WordsPerDoc = sum(self.A, axis=0)        
        DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
        rows, cols = self.A.shape
        for i in range(rows):
            for j in range(cols):
                self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])

mylsa = LSA(stopwords, ignorechars)
for t in titles:
    mylsa.parse(t)
mylsa.build()
a = mylsa.calc()
cosine_similarity(a)

From @ogrisel's answer: 来自@ ogrisel的回答:

I run the following code, but my mouth is still open :) When TFIDF has max 80% similarity on two documents with the same subject, this code give me 99.99%. 我运行以下代码,但我的嘴仍然打开:)当TFIDF在具有相同主题的两个文档上具有最大80%的相似性时,此代码给出99.99%。 That's why I think that it is something wrong :P 这就是为什么我认为这是错误的:P

dataset = [doc1,doc2,doc3]
vectorizer = TfidfVectorizer(max_df=0.5,stop_words='english')
X = vectorizer.fit_transform(dataset)
lsa = TruncatedSVD()
X = lsa.fit_transform(X)
X = Normalizer(copy=False).fit_transform(X)

cosine_similarity(X)

You can use the TruncatedSVD transformer from sklearn 0.14+: you call it with fit_transform on your database of documents and then call the transform method (from the same TruncatedSVD method) on the query document and then can compute the cosine similarity of the transformed query documents with the transformed database with the function: sklearn.metrics.pairwise.cosine_similarity and numpy.argsort the result to find the index of most similar document. 您可以使用sklearn 0.14+中的TruncatedSVD转换器:在文档数据库中使用fit_transform调用它,然后在查询文档上调用transform方法(来自相同的TruncatedSVD方法),然后可以计算转换后的查询文档的余弦相似度用转换后的数据库与函数: sklearn.metrics.pairwise.cosine_similaritynumpy.argsort结果来查找最相似文档的索引。

Note that under the hood, scikit-learn also uses NumPy but in a more efficient way than the snippet you gave (by using the Randomized SVD trick by Halko, Martinsson and Tropp). 请注意,在引擎盖下,scikit-learn也使用NumPy,但是比你给出的片段更有效(通过使用Halko,Martinsson和Tropp的Randomized SVD技巧)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM