将单词转换为潜在语义分析（LSA）向量

Question

Does anyone have any suggestions for how to turn words from a document into LSA vectors using Python and scikit-learn? 有人对如何使用Python和scikit-learn将文档中的单词转换为LSA向量有任何建议吗？ I found these site here and here that decscribe how to turn a whole document into an lsa vector but I am interested in converting the individual words themselves. 我在这里和这里找到了这些站点，这些站点决定了如何将整个文档转换为lsa向量，但是我对转换单个单词本身很感兴趣。

The end result is to sum all the vectors (representing each word) from every sentence and then compare consecutive sentences to assess semantic similarity. 最终结果是将每个句子的所有向量（代表每个单词）求和，然后比较连续的句子以评估语义相似性。

Answer 1

Turning a sentence or a word into a vector is not different than doing so with documents, a sentence is just like a short document and a word is like a very very short one. 将句子或单词变成向量与使用文档一样没有区别，一个句子就像一个简短的文档，一个单词就像一个非常简短的单词。 From first link we have the code for mapping a document to a vector: 从第一个链接开始，我们具有将文档映射到向量的代码：

 def makeVector(self, wordString): """ @pre: unique(vectorIndex) """ #Initialise vector with 0's vector = [0] * len(self.vectorKeywordIndex) wordList = self.parser.tokenise(wordString) wordList = self.parser.removeStopWords(wordList) for word in wordList: vector[self.vectorKeywordIndex[word]] += 1; #Use simple Term Count Model return vector

Same function can be used to map a sentence or a single word to a vector. 可以使用相同的功能将句子或单个单词映射到向量。 Just pass them to this function. 只需将它们传递给此函数即可。 for a word, the result of wordList would be an array holding a single value, something like: ["word"] and then after mapping, the result vector would be a unit vector containing a 1 in associated dimension and 0 s elsewhere. 对于一个单词， wordList的结果将是一个保存单个值的数组，类似于： ["word"] ，然后在映射后，结果向量将是一个单位向量，该向量在关联维度中包含1在其他维度中包含0 s。

Example: 例：

vectorKeywordIndex (representing all words in vocabulary): vectorKeywordIndex （代表词汇表中的所有单词）：

{"hello" : 0, "world" : 1, "this" : 2, "is" : 3, "me" : 4, "answer" : 5}

document "this is me" : [0, 0, 1, 1, 1, 0] 文件"this is me" ： [0, 0, 1, 1, 1, 0]

document "hello answer me" : [1, 0, 0, 0, 1, 1] 文档"hello answer me" ： [1, 0, 0, 0, 1, 1] "hello answer me" [1, 0, 0, 0, 1, 1]

word "hello" : [1, 0, 0, 0, 0, 0] 单词"hello" ： [1, 0, 0, 0, 0, 0]

word "me" : [0, 0, 0, 0, 1, 0] 单词"me" ： [0, 0, 0, 0, 1, 0]

after that similarity can be assessed through several criteria like cosine similarity using this code: 之后，可以使用以下代码通过余弦相似度等几个标准来评估相似度：

 def cosine(vector1, vector2): """ related documents j and q are in the concept space by comparing the vectors using the code: cosine = ( V1 * V2 ) / ||V1|| x ||V2|| """ return float(dot(vector1,vector2) / (norm(vector1) * norm(vector2)))

or by using scikit-learn's sklearn.metrics.pairwise.cosine_similarity . 或使用scikit-learn的sklearn.metrics.pairwise.cosine_similarity 。

from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(x, y)

将单词转换为潜在语义分析（LSA）向量

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-01-10 18:35:51

将单词转换为潜在语义分析（LSA）向量

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-01-10 18:35:51

解决方案1
1 已采纳 2017-01-10 18:35:51