简体   繁体   English

Spark Word2Vec示例说明以及如何获得字符串之间的相似性

[英]Spark word2vec example explanation and how to get similarity between strings

I followed the example in the Spark documentation page to use word2vec, link . 我按照Spark文档页面中的示例使用word2vec, link It worked but I didn't quite understand what it is trying to compute. 它起作用了,但是我不太了解它正在尝试计算什么。

Are the output vectors the output strings representation? 输出向量是输出字符串表示形式吗?

If yes, I tried to compute the cosine similarity between them but I got negative values because the vectors are not positive. 如果是,我尝试计算它们之间的余弦相似度,但由于向量不是正数,所以得到了负值。

Can Spark word2vec create positive only vectors? Spark word2vec可以创建仅正矢量吗?

How to compute similarity between a list of strings using Spark word2vec? 如何使用Spark word2vec计算字符串列表之间的相似度?

The output vector(by using transform on dataset) is a representation of the document(possibly sentence or sentences) which is supplied to the model .So; 输出向量(通过对数据集使用变换)是提供给模型的文档(可能是一个或多个句子)的表示。 in essence this output is a combination of all the vector representation of each of the words in the given document(most likely a simple vector sum). 本质上,此输出是给定文档中每个单词的所有矢量表示的组合(很可能是简单的矢量和)。

You can use findSynonyms to get "num" number of words closest in similarity to the given word. 您可以使用findSynonyms获取与给定单词相似度最接近的单词“ num”个。 findSynonyms is based on cosine similarity only. findSynonyms仅基于余弦相似度。 Currently I am using it to generate feature Vectors which I am using as input to another model. 目前,我正在使用它来生成要素向量,并将其用作其他模型的输入。

In order to compute similarity between two strings as some kind of a no. 为了计算两个字符串之间的相似度,例如某种否。 you would need to implement some variation of findSynonyms method.The current implementation generates a cosVec corresponding to input string and then tries to find the word Vecs which are closest to this vec . 您将需要实现findSynonyms方法的一些变体。当前实现会生成与输入字符串相对应的cosVec,然后尝试查找最接近此vec的单词Vecs。

I am not sure about the part whether it can create only positive vectors and whether it is at all required/(makes sense) to generate only positive vectors. 我不确定该部分是否只能创建正向量,是否完全需要/(有意义)仅生成正向量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM