[英]Cosine Similarity with two Term Frequency vectors in R
I made using tm
in R
a DocumentTermMatrix (dtm). 我在
R
使用tm
了DocumentTermMatrix(dtm)。 if I understand correctly, this matrix displays for each document how often each possible term occurs. 如果我理解正确,此矩阵将为每个文档显示每个可能出现的术语的频率。 Now I can inspect this matrix and I get
现在我可以检查这个矩阵,我得到
Terms
Docs can design door easy finish include light provide use water
176004 1 2 11 8 0 3 3 4 4 4
181288 1 2 11 8 0 2 3 4 4 4
182465 4 4 0 2 0 0 42 13 6 0
etc.
How can I now retrieve the vector of (for example) document 181288? 现在如何检索(例如)文档181288的向量? So I will get something like
所以我会得到类似
1 2 11 8 0 2 3 4 4 4 ………
Also, it says my dtm's sparsity is 100%, is it (by approximation) 100% empty? 另外,它说我的dtm的稀疏度是100%,(大约)是100%空吗?
To retrieve your vector you can do it in multiple ways. 要检索向量,可以采用多种方法。
simple, but not recommended unless for quick test: 简单,但除非快速测试,否则不建议使用:
my_doc <- inspect(dtm[dtm$dimnames$Docs == "181288",])
Doing it like this limits you to what inspect
does and this only shows a maximum of 10 documents. 这样做会限制您进行
inspect
,最多只能显示10个文档。
Better way, create a selection list if you want to and filter the dtm. 更好的方法是,如果要创建一个选择列表并过滤dtm。 This keeps the sparse matrix format, then transform what you need into a data.frame for further manipulation if needed.
这将保持稀疏矩阵格式,然后根据需要将所需的内容转换为data.frame以进行进一步处理。
my_selection <- c("181288", "182465")
# selection in case of dtm
my_dtm_selection <- dtm[dtm$dimnames$Docs %in% my_selection, ]
# selection in case of tdm
my_tdm_selection <- tdm[, tdm$dimnames$Docs %in% my_selection]
# create data.frame with document names as first column, followed by the terms
my_df_selection <- data.frame(docs = Docs(my_dtm_selection), as.matrix(my_dtm_selection))
The answer to your second question: yes, almost empty. 第二个问题的答案是:是的,几乎是空的。 Or better framed, a lot of empty cells.
或更好地构图,有很多空单元格。 But you might have more data than you think if you have a lot of documents and terms.
但是,如果您有大量的文档和条款,则数据可能比您想像的要多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.