R中具有两个项频率向量的余弦相似度

Question

I made using tm in R a DocumentTermMatrix (dtm). 我在R使用tm了DocumentTermMatrix（dtm）。 if I understand correctly, this matrix displays for each document how often each possible term occurs. 如果我理解正确，此矩阵将为每个文档显示每个可能出现的术语的频率。 Now I can inspect this matrix and I get 现在我可以检查这个矩阵，我得到

    Terms
Docs     can design door easy finish include light provide use water
  176004   1      2   11    8      0       3     3       4   4     4
  181288   1      2   11    8      0       2     3       4   4     4
  182465   4      4    0    2      0       0    42      13   6     0
etc.

How can I now retrieve the vector of (for example) document 181288? 现在如何检索（例如）文档181288的向量？ So I will get something like 所以我会得到类似

1      2   11    8      0       2     3       4   4     4 ………

Also, it says my dtm's sparsity is 100%, is it (by approximation) 100% empty? 另外，它说我的dtm的稀疏度是100％，（大约）是100％空吗？

Answer 1

To retrieve your vector you can do it in multiple ways. 要检索向量，可以采用多种方法。

simple, but not recommended unless for quick test: 简单，但除非快速测试，否则不建议使用：

my_doc <- inspect(dtm[dtm$dimnames$Docs == "181288",])

Doing it like this limits you to what inspect does and this only shows a maximum of 10 documents. 这样做会限制您进行inspect ，最多只能显示10个文档。

Better way, create a selection list if you want to and filter the dtm. 更好的方法是，如果要创建一个选择列表并过滤dtm。 This keeps the sparse matrix format, then transform what you need into a data.frame for further manipulation if needed. 这将保持稀疏矩阵格式，然后根据需要将所需的内容转换为data.frame以进行进一步处理。

my_selection <- c("181288", "182465")

# selection in case of dtm
my_dtm_selection <- dtm[dtm$dimnames$Docs %in% my_selection, ]

# selection in case of tdm
my_tdm_selection <- tdm[, tdm$dimnames$Docs %in% my_selection]

# create data.frame with document names as first column, followed by the terms
my_df_selection <- data.frame(docs = Docs(my_dtm_selection), as.matrix(my_dtm_selection))

The answer to your second question: yes, almost empty. 第二个问题的答案是：是的，几乎是空的。 Or better framed, a lot of empty cells. 或更好地构图，有很多空单元格。 But you might have more data than you think if you have a lot of documents and terms. 但是，如果您有大量的文档和条款，则数据可能比您想像的要多。

R中具有两个项频率向量的余弦相似度

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-06-21 10:22:21

R中具有两个项频率向量的余弦相似度

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-06-21 10:22:21

解决方案1
1 已采纳 2018-06-21 10:22:21