在文档项矩阵R中仅保留某些二元组

Question

问题：如何仅在文档术语矩阵或我要保留的双字母（术语）列表中使双字母“不好”？

我想将其应用于非常大的文档术语矩阵。 我尝试将术语矩阵转换为矩阵，但矢量大小超过1000 Gb。

码：

dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
       "So that in many cases such a ",
       "But there were still other and",
       "Not even at the rationale"), stringsAsFactors = F)

library(tm)
library(RWeka)

myReader <- readTabular(mapping = list(content = "text", id = "id"))

#create v corpus
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))

#n-gram tokenizer
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

#create document term matrix using Tokenizer
       dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer))
       inspect(dtm)

输出：

                             Docs
            Terms           10 11 12 13
            at the          0  0  0  1
            but there       0  0  1  0
            cases such      0  1  0  0
            even at         0  0  0  1
            in many         0  1  0  0
            many cases      0  1  0  0
            no wonderful    1  0  0  0
            not even        0  0  0  1
            other and       0  0  1  0
            so that         0  1  0  0
            still other     0  0  1  0
            such a          0  1  0  0
            that ever       1  0  0  0
            that in         0  1  0  0
            the rationale   0  0  0  1
            then that       1  0  0  0
            there were      0  0  1  0
            were still      0  0  1  0
            wonderful then  1  0  0  0

Answer 1

当时以为它是DTM，所以更加复杂。

问题解决了：

    d_sel <- dtm[c('no wonderful', 'there were'),]
    inspect(d_sel)

                Docs
                Terms          10 11 12 13
                no wonderful    1  0  0  0
                there were      0  0  1  0

在文档项矩阵R中仅保留某些二元组

问题描述

1 个解决方案

解决方案1
0 2017-02-09 18:19:15

在文档项矩阵R中仅保留某些二元组

问题描述

1 个解决方案

解决方案1 0 2017-02-09 18:19:15

解决方案1
0 2017-02-09 18:19:15