简体   繁体   English

如何计算两个字符串向量之间的余弦相似度

[英]How can I calculate Cosine similarity between two strings vectors

I have 2 vectors of dimensions 6 and I would like to have a number between 0 and 1.我有 2 个维度为 6 的向量,我想要一个介于 0 和 1 之间的数字。

a=c("HDa","2Pb","2","BxU","BuQ","Bve")

b=c("HCK","2Pb","2","09","F","G")

Can anyone explain what I should do?谁能解释一下我应该怎么做?

using the lsa package and the manual for this package使用lsa包和这个包的手册

# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))

# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)

EDIT: show how is the mymatrix object编辑:显示mymatrix对象如何

myMatrix
#myMatrix
#       docs
#  terms D1 D2
#    2    1  1
#    2pb  1  1
#    buq  1  0
#    bve  1  0
#    bxu  1  0
#    hda  1  0
#    09   0  1
#    f    0  1
#    g    0  1
#    hck  0  1

# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333

You need a dictionary of possible terms first and then convert your vectors to binary vectors with a 1 in the positions of the corresponding terms and 0 elsewhere.您首先需要一个可能术语的字典,然后将您的向量转换为二进制向量,相应术语的位置为 1,其他位置为 0。 If you name the new vectors a2 and b2 , you can calculate the cosine similarly with cor(a2, b2) , but notice the cosine similarly is between -1 and 1. You could map it to [0,1] with something like this: 0.5*cor(a2, b2) + 0.5如果您将新向量命名为a2b2 ,您可以使用cor(a2, b2)类似地计算余弦,但请注意余弦类似地介于 -1 和 1 之间。您可以使用类似的方法将其映射到 [0,1] : 0.5*cor(a2, b2) + 0.5

CSString_vector <- c("Hi Hello","Hello");
corp <- tm::VCorpus(VectorSource(CSString_vector));
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf), weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix);
matrix_of_vector = as.matrix(dtm);
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,]);

could be the better one for the larger data set.对于更大的数据集,可能是更好的选择。

Advanced form of embedding might help you to get better output.高级形式的嵌入可能会帮助您获得更好的输出。 Please check the following code.请检查以下代码。 It is a Universal sentence encode model that generates the sentence embedding using transformer-based architecture.它是一种通用句子编码模型,使用基于转换器的架构生成句子嵌入。

from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model([input])

paragraph = [
    "Universal Sentence Encoder embeddings also support short paragraphs. ",
    "Universal Sentence Encoder support paragraphs"]
messages = [paragraph]

print(np.inner( embed(paragraph[0]) , embed(paragraph[1])))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 quanteda 计算两组单个文档之间的余弦相似度? - How can I calculate cosine similarity between two sets of individual documents, using quanteda? 为什么两个向量之间的余弦相似度可以为负? - Why can cosine similarity between two vectors be negative? 如何计算矩阵的第一行与R中的每一行之间的余弦相似度? - How can I calculate cosine similarity between first row of my matrix with each other rows in R? 计算 R 中两个向量/字符串之间的相似度 - Calculating similarity between two vectors/Strings in R 计算R中tm包的TermDocumentMatrix中两个文档之间的余弦相似度 - Calculate Cosine Similarity between two documents in TermDocumentMatrix of tm Package in R R中具有两个项频率向量的余弦相似度 - Cosine Similarity with two Term Frequency vectors in R 在R中,如何计算两个字符串向量之间的KL距离? - In R, how to calculate KL Distance between two vectors of strings? 计算R中两个单词的余弦相似度? - calculate cosine similarity of two words in R? 如何计算R中两个向量之间不同的众所周知的相似性或距离度量? - How to calculate different well-known similarity or distance measures between two vectors in R? 余弦相似度:函数无法计算矩阵 - Cosine Similarity: Funtion Can't Calculate The Matrix
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM