簡體   English   中英

在文檔項矩陣R中僅保留某些二元組

[英]Keep only certain bigrams in document term matrix R

問題:如何僅在文檔術語矩陣或我要保留的雙字母(術語)列表中使雙字母“不好”?

我想將其應用於非常大的文檔術語矩陣。 我嘗試將術語矩陣轉換為矩陣,但矢量大小超過1000 Gb。

碼:

dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
       "So that in many cases such a ",
       "But there were still other and",
       "Not even at the rationale"), stringsAsFactors = F)

library(tm)
library(RWeka)

myReader <- readTabular(mapping = list(content = "text", id = "id"))

#create v corpus
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))

#n-gram tokenizer
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

#create document term matrix using Tokenizer
       dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer))
       inspect(dtm)

輸出:

                             Docs
            Terms           10 11 12 13
            at the          0  0  0  1
            but there       0  0  1  0
            cases such      0  1  0  0
            even at         0  0  0  1
            in many         0  1  0  0
            many cases      0  1  0  0
            no wonderful    1  0  0  0
            not even        0  0  0  1
            other and       0  0  1  0
            so that         0  1  0  0
            still other     0  0  1  0
            such a          0  1  0  0
            that ever       1  0  0  0
            that in         0  1  0  0
            the rationale   0  0  0  1
            then that       1  0  0  0
            there were      0  0  1  0
            were still      0  0  1  0
            wonderful then  1  0  0  0

當時以為它是DTM,所以更加復雜。

問題解決了:

    d_sel <- dtm[c('no wonderful', 'there were'),]
    inspect(d_sel)

                Docs
                Terms          10 11 12 13
                no wonderful    1  0  0  0
                there were      0  0  1  0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM