[英]Cosine Similarity Matrix in R
I have a document term matrix, "mydtm" that I have created in R, using the 'tm' package.我有一个文档术语矩阵“mydtm”,我使用“tm”package 在 R 中创建。 I am attempting to depict the similarities between each of the 557 documents contained within the dtm/corpus.我试图描述 dtm/corpus 中包含的 557 个文档之间的相似之处。 I have been attempting to use a cosine similarity matrix using: mydtm_cosine <- dist(mydtm_matrix, method = "cosine", diag = F, upper = F) However the output matrix I get is huge with many missing values.我一直在尝试使用余弦相似度矩阵:mydtm_cosine <- dist(mydtm_matrix, method = "cosine", diag = F, upper = F) 但是我得到的 output 矩阵很大,有很多缺失值。 Any help/suggestions would be much appreciated.任何帮助/建议将不胜感激。 Output Matrix Output 矩阵
Likely you have few words which occur between your documents.您的文档之间可能很少出现单词。 You may wish to reduce the words in your term document matrix.您可能希望减少术语文档矩阵中的单词。
text <- c("term-document matrix is a mathematical matrix",
"we now have a tidy three-column",
"cast into a Term-Document Matrix",
"where the rows represent the text responses, or documents")
corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,
control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm,
MARGIN = 1,
FUN = function(x) sum(x > 0) / ncol(tdm))
occurrence
# a cast documents have
# 0.75 0.25 0.25 0.25
# into is mathematical matrix
# 0.25 0.25 0.25 0.50
# now or represent responses,
# 0.25 0.25 0.25 0.25
# rows term-document text the
# 0.25 0.50 0.25 0.25
# three-column tidy we where
# 0.25 0.25 0.25 0.25
quantile(occurrence, probs = c(0.5, 0.9, 0.99))
# 50% 90% 99%
# 0.2500 0.5000 0.7025
tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])
tdm_mat
# Docs
# Terms 1 2 3 4
# a 1 1 1 0
# matrix 2 0 1 0
# term-document 1 0 1 0
You can then calculate cosine similarity.然后,您可以计算余弦相似度。
library(proxy)
dist(tdm_mat, method = "cosine", upper = TRUE)
# a matrix term-document
# a 0.2254033 0.1835034
# matrix 0.2254033 0.0513167
# term-document 0.1835034 0.0513167
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.