简体   繁体   English

计算R中两个单词的余弦相似度?

[英]calculate cosine similarity of two words in R?

I have a text file and would like to create semantic vectors for each word in the file. 我有一个文本文件,想为文件中的每个单词创建语义向量。 I would then like to extract the cosine similarity for about 500 pairs of words. 然后,我想提取约500对单词的余弦相似度。 What is the best package in R for doing this? R中最好的软件包是什么?

You can use lsa library. 您可以使用lsa库。 cosine function of the library gives a matrix of cosine similarity. 库的cosine函数给出了余弦相似度矩阵。 It takes a matrix as input. 它以矩阵作为输入。

If I understand your problem correctly, you want the cosine similarity of two vectors of words. 如果我正确理解了您的问题,则需要两个单词向量的余弦相似度。 Let us start with the cosine similiarity of two words only: 让我们从两个词的余弦相似性开始:

library(stringdist)
d <- stringdist("ca","abc",method="cosine")

The result is d= 0.1835034 as expected. 结果是d= 0.1835034如预期的那样。

There is also a function stringdistmatrix() contained in that package which calculates the distance between all pairs of strings: 该包中还包含一个函数stringdistmatrix() ,该函数计算所有字符串对之间的距离:

> d <- stringdistmatrix(c('foo','bar','boo','baz'))
> d
  1 2 3
2 3    
3 1 2  
4 3 1 2

For your purpose, you can simply use something like this 为了您的目的,您可以简单地使用类似这样的东西

stringdist(c("ca","abc"),c("aa","abc"),method="cosine")

The result are the measure for the distances between ca and aa on the one hand and abc compared with abc on the other hand: 结果是用于之间的距离度量caaa ,一方面和abc相比abc另一方面:

0.2928932 0.0000000

Disclaimer: The library stringdist is brand new (June 2019), but seems to work nicely. 免责声明:stringdist是全新的(2019年6月),但似乎运行良好。 I am not associated with the authors of the library. 我与图书馆的作者无关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM