繁体   English   中英

在文档项矩阵R中仅保留某些二元组

[英]Keep only certain bigrams in document term matrix R

问题:如何仅在文档术语矩阵或我要保留的双字母(术语)列表中使双字母“不好”?

我想将其应用于非常大的文档术语矩阵。 我尝试将术语矩阵转换为矩阵,但矢量大小超过1000 Gb。

码:

dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
       "So that in many cases such a ",
       "But there were still other and",
       "Not even at the rationale"), stringsAsFactors = F)

library(tm)
library(RWeka)

myReader <- readTabular(mapping = list(content = "text", id = "id"))

#create v corpus
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))

#n-gram tokenizer
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

#create document term matrix using Tokenizer
       dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer))
       inspect(dtm)

输出:

                             Docs
            Terms           10 11 12 13
            at the          0  0  0  1
            but there       0  0  1  0
            cases such      0  1  0  0
            even at         0  0  0  1
            in many         0  1  0  0
            many cases      0  1  0  0
            no wonderful    1  0  0  0
            not even        0  0  0  1
            other and       0  0  1  0
            so that         0  1  0  0
            still other     0  0  1  0
            such a          0  1  0  0
            that ever       1  0  0  0
            that in         0  1  0  0
            the rationale   0  0  0  1
            then that       1  0  0  0
            there were      0  0  1  0
            were still      0  0  1  0
            wonderful then  1  0  0  0

当时以为它是DTM,所以更加复杂。

问题解决了:

    d_sel <- dtm[c('no wonderful', 'there were'),]
    inspect(d_sel)

                Docs
                Terms          10 11 12 13
                no wonderful    1  0  0  0
                there were      0  0  1  0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM