在R tm包中的term-document矩陣中包括所有標記

Question

我試圖用R中的tm包的TermDocumentMatrix函數創建一個術語文檔矩陣，發現其中不包含某些單詞。

> library(tm)
> tdm <- TermDocumentMatrix(Corpus(VectorSource("The book is of great importance.")))
> rownames(tdm)
[1] "book"        "great"       "importance." "the"

此處，單詞is和of已從矩陣中排除。 如果語料庫僅包含已刪除的單詞，則會顯示以下消息。

> tdm <- TermDocumentMatrix(Corpus(VectorSource("of is of is")))
Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
> rownames(tdm)
NULL

矩陣建造之前是和的消息信號被刪除，但我一直無法弄清楚它為什么會發生，我怎么能包括在語料庫中的所有令牌。

任何幫助表示贊賞。

Answer 1

使用TermDocumentMatrix的控制參數

require(tm)
tdm <- TermDocumentMatrix(Corpus(VectorSource("of is of is")), control =  list(stopwords=FALSE, wordLengths=c(0, Inf)))
rownames(tdm)

在R tm包中的term-document矩陣中包括所有標記

問題描述

1 個解決方案

解決方案1
3 已采納

在R tm包中的term-document矩陣中包括所有標記

問題描述

1 個解決方案

解決方案1 3 已采納

解決方案1
3 已采納