將單詞轉換為潛在語義分析（LSA）向量

Question

有人對如何使用Python和scikit-learn將文檔中的單詞轉換為LSA向量有任何建議嗎？ 我在這里和這里找到了這些站點，這些站點決定了如何將整個文檔轉換為lsa向量，但是我對轉換單個單詞本身很感興趣。

最終結果是將每個句子的所有向量（代表每個單詞）求和，然后比較連續的句子以評估語義相似性。

Answer 1

將句子或單詞變成向量與使用文檔一樣沒有區別，一個句子就像一個簡短的文檔，一個單詞就像一個非常簡短的單詞。 從第一個鏈接開始，我們具有將文檔映射到向量的代碼：

 def makeVector(self, wordString): """ @pre: unique(vectorIndex) """ #Initialise vector with 0's vector = [0] * len(self.vectorKeywordIndex) wordList = self.parser.tokenise(wordString) wordList = self.parser.removeStopWords(wordList) for word in wordList: vector[self.vectorKeywordIndex[word]] += 1; #Use simple Term Count Model return vector

可以使用相同的功能將句子或單個單詞映射到向量。 只需將它們傳遞給此函數即可。 對於一個單詞， wordList的結果將是一個保存單個值的數組，類似於： ["word"] ，然后在映射后，結果向量將是一個單位向量，該向量在關聯維度中包含1在其他維度中包含0 s。

例：

vectorKeywordIndex （代表詞匯表中的所有單詞）：

{"hello" : 0, "world" : 1, "this" : 2, "is" : 3, "me" : 4, "answer" : 5}

文件"this is me" ： [0, 0, 1, 1, 1, 0]

文檔"hello answer me" ： [1, 0, 0, 0, 1, 1] "hello answer me" [1, 0, 0, 0, 1, 1]

單詞"hello" ： [1, 0, 0, 0, 0, 0]

單詞"me" ： [0, 0, 0, 0, 1, 0]

之后，可以使用以下代碼通過余弦相似度等幾個標准來評估相似度：

 def cosine(vector1, vector2): """ related documents j and q are in the concept space by comparing the vectors using the code: cosine = ( V1 * V2 ) / ||V1|| x ||V2|| """ return float(dot(vector1,vector2) / (norm(vector1) * norm(vector2)))

或使用scikit-learn的sklearn.metrics.pairwise.cosine_similarity 。

from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(x, y)

將單詞轉換為潛在語義分析（LSA）向量

問題描述

1 個解決方案

解決方案1
1 已采納 2017-01-10 18:35:51

將單詞轉換為潛在語義分析（LSA）向量

問題描述

1 個解決方案

解決方案1 1 已采納 2017-01-10 18:35:51

解決方案1
1 已采納 2017-01-10 18:35:51