在文檔項矩陣R中僅保留某些二元組

Question

問題：如何僅在文檔術語矩陣或我要保留的雙字母（術語）列表中使雙字母“不好”？

我想將其應用於非常大的文檔術語矩陣。 我嘗試將術語矩陣轉換為矩陣，但矢量大小超過1000 Gb。

碼：

dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
       "So that in many cases such a ",
       "But there were still other and",
       "Not even at the rationale"), stringsAsFactors = F)

library(tm)
library(RWeka)

myReader <- readTabular(mapping = list(content = "text", id = "id"))

#create v corpus
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))

#n-gram tokenizer
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

#create document term matrix using Tokenizer
       dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer))
       inspect(dtm)

輸出：

                             Docs
            Terms           10 11 12 13
            at the          0  0  0  1
            but there       0  0  1  0
            cases such      0  1  0  0
            even at         0  0  0  1
            in many         0  1  0  0
            many cases      0  1  0  0
            no wonderful    1  0  0  0
            not even        0  0  0  1
            other and       0  0  1  0
            so that         0  1  0  0
            still other     0  0  1  0
            such a          0  1  0  0
            that ever       1  0  0  0
            that in         0  1  0  0
            the rationale   0  0  0  1
            then that       1  0  0  0
            there were      0  0  1  0
            were still      0  0  1  0
            wonderful then  1  0  0  0

Answer 1

當時以為它是DTM，所以更加復雜。

問題解決了：

    d_sel <- dtm[c('no wonderful', 'there were'),]
    inspect(d_sel)

                Docs
                Terms          10 11 12 13
                no wonderful    1  0  0  0
                there were      0  0  1  0

在文檔項矩陣R中僅保留某些二元組

問題描述

1 個解決方案

解決方案1
0 2017-02-09 18:19:15

在文檔項矩陣R中僅保留某些二元組

問題描述

1 個解決方案

解決方案1 0 2017-02-09 18:19:15

解決方案1
0 2017-02-09 18:19:15