[英]calculate cosine similarity of two words in R?
I have a text file and would like to create semantic vectors for each word in the file. 我有一个文本文件,想为文件中的每个单词创建语义向量。 I would then like to extract the cosine similarity for about 500 pairs of words.
然后,我想提取约500对单词的余弦相似度。 What is the best package in R for doing this?
R中最好的软件包是什么?
You can use lsa
library. 您可以使用
lsa
库。 cosine
function of the library gives a matrix of cosine similarity. 库的
cosine
函数给出了余弦相似度矩阵。 It takes a matrix as input. 它以矩阵作为输入。
If I understand your problem correctly, you want the cosine similarity of two vectors of words. 如果我正确理解了您的问题,则需要两个单词向量的余弦相似度。 Let us start with the cosine similiarity of two words only:
让我们从两个词的余弦相似性开始:
library(stringdist)
d <- stringdist("ca","abc",method="cosine")
The result is d= 0.1835034
as expected. 结果是
d= 0.1835034
如预期的那样。
There is also a function stringdistmatrix()
contained in that package which calculates the distance between all pairs of strings: 该包中还包含一个函数
stringdistmatrix()
,该函数计算所有字符串对之间的距离:
> d <- stringdistmatrix(c('foo','bar','boo','baz'))
> d
1 2 3
2 3
3 1 2
4 3 1 2
For your purpose, you can simply use something like this 为了您的目的,您可以简单地使用类似这样的东西
stringdist(c("ca","abc"),c("aa","abc"),method="cosine")
The result are the measure for the distances between ca
and aa
on the one hand and abc
compared with abc
on the other hand: 结果是用于之间的距离度量
ca
和aa
,一方面和abc
相比abc
另一方面:
0.2928932 0.0000000
Disclaimer: The library stringdist is brand new (June 2019), but seems to work nicely. 免责声明:库stringdist是全新的(2019年6月),但似乎运行良好。 I am not associated with the authors of the library.
我与图书馆的作者无关。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.