[英]How to calculate similarity for pre-trained word embeddings
我想從R中的預訓練嵌入向量中知道與另一個詞最相似的詞。 例如:類似於“啤酒”的詞。 為此,我在http://nlp.stanford.edu/data/glove.twitter.27B.zip下載預訓練嵌入向量並應用以下代碼:
源代碼:
glove_dir = "~/Downloads/glove.6B"
lines <- readLines(file.path(glove_dir, "glove.6B.100d.txt"))
embeddings_index <- new.env(hash = TRUE, parent = emptyenv())
for (i in 1:length(lines)) {
line <- lines[[i]]
values <- strsplit(line, " ")[[1]]
word <- values[[1]]
embeddings_index[[word]] <- as.double(values[-1])
}
cat("Found", length(embeddings_index), "word vectors.\n")
embedding_dim <- 100
embedding_matrix <- array(0, c(max_words, embedding_dim))
for (word in names(word_index)) {
index <- word_index[[word]]
if (index < max_words) {
embedding_vector <- embeddings_index[[word]]
if (!is.null(embedding_vector))
embedding_matrix[index+1,] <- embedding_vector
}
}
但是我不知道如何得到最相似的詞。 我找到了示例但不起作用,因為嵌入向量的結構不同
find_similar_words <- function(word, embedding_matrix, n = 5) {
similarities <- embedding_matrix[word, , drop = FALSE] %>%
sim2(embedding_matrix, y = ., method = "cosine")
similarities[,1] %>% sort(decreasing = TRUE) %>% head(n)
}
find_similar_words("beer", embedding_matrix)
如何計算R中預訓練詞嵌入的相似度?
一種解決方案可能是使用text
-package ( www.r-text.org )。
# for installation guidelines see: http://www.r-text.org/articles/Extended_Installation_Guide.html
library(text)
text_example <- c("beer wine nurse doctor")
text_example_embedding <- textEmbed(text_example, contexts = FALSE)
word_ss <- textSimilarityMatrix(text_example_embedding$singlewords_we)
# The order of the words has changed, so name them according to how they appear in the output.
colnames(word_ss) <- text_example_embedding$singlewords_we$words
rownames(word_ss) <- text_example_embedding$singlewords_we$words
round(word_ss, 3)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.