[英]Keep only certain bigrams in document term matrix R
Question: How can I keep the bigram "no wonderful" only in the document term matrix or a list of bigrams (Terms) that I want to keep? 问题:如何仅在文档术语矩阵或我要保留的双字母(术语)列表中使双字母“不好”?
I would like to apply this to a very large document term matrix. 我想将其应用于非常大的文档术语矩阵。 I tried converting the term matrix to a matrix but the vector size exceed 1000 Gb.
我尝试将术语矩阵转换为矩阵,但矢量大小超过1000 Gb。
Code: 码:
dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
"So that in many cases such a ",
"But there were still other and",
"Not even at the rationale"), stringsAsFactors = F)
library(tm)
library(RWeka)
myReader <- readTabular(mapping = list(content = "text", id = "id"))
#create v corpus
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))
#n-gram tokenizer
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
#create document term matrix using Tokenizer
dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer))
inspect(dtm)
Output: 输出:
Docs
Terms 10 11 12 13
at the 0 0 0 1
but there 0 0 1 0
cases such 0 1 0 0
even at 0 0 0 1
in many 0 1 0 0
many cases 0 1 0 0
no wonderful 1 0 0 0
not even 0 0 0 1
other and 0 0 1 0
so that 0 1 0 0
still other 0 0 1 0
such a 0 1 0 0
that ever 1 0 0 0
that in 0 1 0 0
the rationale 0 0 0 1
then that 1 0 0 0
there were 0 0 1 0
were still 0 0 1 0
wonderful then 1 0 0 0
Was thinking it was more complicated since it was a DTM. 当时以为它是DTM,所以更加复杂。
Problem solved: 问题解决了:
d_sel <- dtm[c('no wonderful', 'there were'),]
inspect(d_sel)
Docs
Terms 10 11 12 13
no wonderful 1 0 0 0
there were 0 0 1 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.