稀疏度为 0% 的 DocumentTermMatrix

Question

I'm trying to obtain a document term matrix from a book in Italian.我正在尝试从意大利语书中获取文档术语矩阵。 I have the pdf file of this book and I wrote few rows of code:我有这本书的pdf文件，我写了几行代码：

#install.packages("pdftools")
library(pdftools)
library(tm)
text <- pdf_text("IoRobot.pdf")
# collapse pdf pages into 1
text <- paste(unlist(text), collapse ="")
myCorpus <- VCorpus(VectorSource(text))
mydtm <-DocumentTermMatrix(myCorpus,control = list(removeNumbers = TRUE, removePunctuation = TRUE,
                                 stopwords=stopwords("it"), stemming=TRUE))
inspect(mydtm)

The result I obtained after the last row is:我在最后一行之后得到的结果是：

<<DocumentTermMatrix (documents: 1, terms: 10197)>>
Non-/sparse entries: 10197/0
Sparsity           : 0%
Maximal term length: 39
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs calvin cosa donovan esser piú poi powel prima quando robot
   1    201  191     254   193 288 211   287   166    184   62

I noticed that the sparsity is 0%.我注意到稀疏度为 0%。 Is this normal?这是正常的吗？

Answer 1

Yes it seems correct.是的，这似乎是正确的。
A document term matrix is a matrix that has as row the documents, as columns the terms, and 0 or 1 if the term is in the document in the row (1) or not (0). 文档术语矩阵是一个矩阵，其中文档作为行，术语作为列，如果术语在文档中的行 (1) 或不在 (0) 中，则为 0 或 1。
Sparsity is and indicator that points out the "quantity of 0s" in document term matrix.稀疏性是指出文档术语矩阵中“0 的数量”的指标。
You can define a sparse term, when it's not in a document, looking from here .您可以定义一个稀疏术语，当它不在文档中时，从这里查看。

To understand those gists, let's have a look to a reproducible example that creates a situation similar to your:要理解这些要点，让我们看一个可重现的示例，该示例会产生类似于您的情况：

library(tm)
text <- c("here some text")
corpus <- VCorpus(VectorSource(text))
DTM <- DocumentTermMatrix(corpus)
DTM

<<DocumentTermMatrix (documents: 1, terms: 3)>>
Non-/sparse entries: 3/0
Sparsity           : 0%
Maximal term length: 4
Weighting          : term frequency (tf)

Looking at the output, we can see you have one document (so a DTM with that corpus is made of one line).查看输出，我们可以看到您有一个文档（因此带有该语料库的 DTM 由一行组成）。
Having a look at it:看看它：

as.matrix(DTM)
    Terms
Docs here some text
   1    1    1    1

Now it could be easier to understand the output:现在可以更容易地理解输出：

You have one doc with tree terms:您有一个包含树术语的文档：

<<DocumentTermMatrix (documents: 1, terms: 3)>> <<DocumentTermMatrix（文档：1，条款：3）>>
Your non sparse (ie != 0 in DTM ) are 3, and sparse == 0 :您的非稀疏（即!= 0 in DTM ）为 3，并且sparse == 0 ：

Non-/sparse entries: 3/0非/稀疏条目：3/0

So your sparsity is == 0% , because you cannot have some 0s in one document corpus;所以你的稀疏度是== 0% ，因为你不能在一个文档语料库中有一些 0； every term belongs to the unique document, so you'll have all ones:每个术语都属于唯一的文档，因此您将拥有所有术语：

  Sparsity           : 0%

Having a look at a different example, that has sparse terms:看一个不同的例子，它有稀疏的术语：

text <- c("here some text", "other text")

corpus <- VCorpus(VectorSource(text))
DTM <- DocumentTermMatrix(corpus)

DTM
<<DocumentTermMatrix (documents: 2, terms: 4)>>
Non-/sparse entries: 5/3
Sparsity           : 38%
Maximal term length: 5
Weighting          : term frequency (tf)

as.matrix(DTM)
    Terms
Docs here other some text
   1    1     0    1    1
   2    0     1    0    1

Now you have 3 sparse terms (3/5), and if you do 3/8 = 0.375 ie the 38% of sparsity.现在你有 3 个稀疏项 (3/5)，如果你做 3/8 = 0.375，即 38% 的稀疏性。

稀疏度为 0% 的 DocumentTermMatrix

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-11-07 14:41:02

稀疏度为 0% 的 DocumentTermMatrix

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-11-07 14:41:02

解决方案1
4 已采纳 2020-11-07 14:41:02