如何将数据帧转换为DocumentTermMatrix？

Question

I am trying to use tidytext to transform a tibble of word frequencies into a DocumentTermMatrix, but the function doesn't seem to work as expected. 我正在尝试使用tidytext将一小部分单词频率转换为DocumentTermMatrix，但该功能似乎无法按预期工作。 I start from AssociatedPress which I know is a documentTermMatrix, tidy and cast it back, but the output is not the same as the original matrix. 我从AssociatedPress开始，我知道它是一个documentTermMatrix，整理并投射回去，但输出与原始矩阵不同。 What am I doing wrong? 我究竟做错了什么？

library(topicmodels)
data(AssociatedPress)
ap_td <- tidy(AssociatedPress)
tt <- ap_td %>%
  cast_dtm(document, term, count)

The element $Docs is not-NULL when I cast ap_td but it was NULL in AssociatedPress : str(tt) ap_td时，元素$Docs不为NULL，但在AssociatedPress为NULL：str（tt）

List of 6
 $ i       : int [1:302031] 1 16 35 72 84 93 101 111 155 161 ...
 $ j       : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 2246
 $ ncol    : int 10473
 $ dimnames:List of 2
  ..$ Docs : chr [1:2246] "1" "2" "3" "4" ...
  ..$ Terms: chr [1:10473] "adding" "adult" "ago" "alcohol" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

List of 6
 $ i       : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
 $ v       : num [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
 $ nrow    : int 2246
 $ ncol    : int 10473
 $ dimnames:List of 2
  ..$ Docs : NULL
  ..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

cast_dtm retrieves a warning cast_dtm检索警告

Warning message: Trying to compute distinct() for variables not found in the data: - row_col , column_col This is an error, but only a warning is raised for compatibility reasons. 警告消息：尝试为数据中未找到的变量计算distinct（）：- row_col和column_col这是一个错误，但出于兼容性原因仅引发警告。 The operation will return the input unchanged. 该操作将使输入保持不变。

On GitHub, I found this issue which should have been fixed now. 在GitHub上，我发现了应该立即修复的问题。

Answer 1

I don't get your warning message using tidytext 0.1.9.900 and R 3.5.0. 我没有收到使用tidytext 0.1.9.900和R 3.5.0发出的警告消息。

The dtm's are identical for the number of terms, rows and columns. 术语，行和列的数量的dtm相同。 Also all the counts are correct. 而且所有计数都是正确的。

The difference is indeed between the $dimnames$Docs of tt$dimnames$Docs and AssociatedPress$dimnames$Docs . 的确确实是tt$dimnames$Docs和AssociatedPress$dimnames$Docs 。

The reason for this is that if there are no docids in the dtm before tidying as is the case with AssociatedPress, the tidy function assigns AssociatedPress$i to the document variable in the tidy_text (ap_td). 这样做的原因是，如果在整理之前dtm中没有docid（与AssociatedPress一样），则tidy函数会将AssociatedPress $ i分配给tidy_text（ap_td）中的文档变量。 Casting this back into a dtm, will fill the $dimnames$Docs with the document value from the tidy_text data.frame (ap_td). 将其转换回dtm，将使用tidy_text data.frame（ap_td）中的文档值填充$ dimnames $ Docs。 So in the end the AssociatedPress$i values will end up in tt$dimnames$Docs. 因此，最终AssociatedPress $ i值将以tt $ dimnames $ Docs结尾。

You can see that if you compare the $i from Associated Press with the Docs from tt. 您可以看到，如果将美联社的$ i与tt的文档进行比较。

all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs))
[1] TRUE

Or comparing from AssociatedPress to ap_td to tt: 或从AssociatedPress到ap_td到tt进行比较：

all.equal(unique(as.character(AssociatedPress$i)), unique(tt$dimnames$Docs), unique(ap_td))
[1] TRUE

If you want to follow the logic yourself, you can check all the functions used on the github page for the sparse_tidiers . 如果您想自己遵循逻辑，则可以检查github页面上所有用于sparse_tidiers的功能。 Start with tidy.DocumentTermMatrix and follow the function calls to tidy.simple_triplet_matrix and finally to tidy_triplet . 先从tidy.DocumentTermMatrix ，并按照函数调用tidy.simple_triplet_matrix终于tidy_triplet 。

如何将数据帧转换为DocumentTermMatrix？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-09-08 09:02:40

如何将数据帧转换为DocumentTermMatrix？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-09-08 09:02:40

解决方案1
1 已采纳 2018-09-08 09:02:40