簡體   English   中英

如何計算兩個字符串向量之間的余弦相似度

[英]How can I calculate Cosine similarity between two strings vectors

我有 2 個維度為 6 的向量,我想要一個介於 0 和 1 之間的數字。

a=c("HDa","2Pb","2","BxU","BuQ","Bve")

b=c("HCK","2Pb","2","09","F","G")

誰能解釋一下我應該怎么做?

使用lsa包和這個包的手冊

# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))

# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)

編輯:顯示mymatrix對象如何

myMatrix
#myMatrix
#       docs
#  terms D1 D2
#    2    1  1
#    2pb  1  1
#    buq  1  0
#    bve  1  0
#    bxu  1  0
#    hda  1  0
#    09   0  1
#    f    0  1
#    g    0  1
#    hck  0  1

# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333

您首先需要一個可能術語的字典,然后將您的向量轉換為二進制向量,相應術語的位置為 1,其他位置為 0。 如果您將新向量命名為a2b2 ,您可以使用cor(a2, b2)類似地計算余弦,但請注意余弦類似地介於 -1 和 1 之間。您可以使用類似的方法將其映射到 [0,1] : 0.5*cor(a2, b2) + 0.5

CSString_vector <- c("Hi Hello","Hello");
corp <- tm::VCorpus(VectorSource(CSString_vector));
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf), weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix);
matrix_of_vector = as.matrix(dtm);
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,]);

對於更大的數據集,可能是更好的選擇。

高級形式的嵌入可能會幫助您獲得更好的輸出。 請檢查以下代碼。 它是一種通用句子編碼模型,使用基於轉換器的架構生成句子嵌入。

from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model([input])

paragraph = [
    "Universal Sentence Encoder embeddings also support short paragraphs. ",
    "Universal Sentence Encoder support paragraphs"]
messages = [paragraph]

print(np.inner( embed(paragraph[0]) , embed(paragraph[1])))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM