简体   繁体   English

将矩阵转换回DocumentTermMatrix

[英]Converting matrix back into a DocumentTermMatrix

I'm very new to text mining, but I wanted to analysis tweets over a period of time. 我是文本挖掘的新手,但是我想分析一段时间内的推文。

I scraped tweets from Twitter weeks ago and am only now getting to analysis it. 几周前,我从Twitter抓取了推文,现在才开始对其进行分析。 I saved the DocumentTermMatrix as a matrix and am running into difficulty converting it back to a DocumentTermMatrix to perform latent dirichlet allocation on the data. 我将DocumentTermMatrix保存为矩阵,并遇到将其转换回DocumentTermMatrix以对数据执行潜在Dirichlet分配的困难。

scrap<- searchTwitter("#RepealThe8th", n=1500)
twscrap <- sapply(scrap, function(x) x$getText())
corpus1 <- Corpus(VectorSource(twscrap))
corpus1 <- tm_map(corpus1,
              content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
              mc.cores=1)

corpus1 <- tm_map(corpus1, content_transformer(tolower), mc.cores=1)
corpus1 <- tm_map(corpus1, removePunctuation, mc.cores=1)
corpus1 <- tm_map(corpus1, function(x)removeWords(x,stopwords()), mc.cores=1)
corpus1 <- tm_map(corpus1, stemDocument, mc.cores=1)

myStopwords = c("https", "http");
idx = which(myStopwords == "r");
myStopwords = myStopwords[-idx];
corpus1 = tm_map(corpus1, removeWords, myStopwords);

corpus1 <- tm_map(corpus1, stripWhitespace) 
plaincorpus1 <- tm_map(corpus1, PlainTextDocument)
dtm <- DocumentTermMatrix(plaincorpus1, control = list(minWordLength = 3));
m <- as.matrix(dtm)

That was how I originally saved the data 那就是我最初保存数据的方式

write.csv(m, "matrix.csv")

When I load the data in I can't get it back into DTM form 当我加载数据时,无法将其恢复为DTM格式

m <- read.csv("matrix.csv",header=TRUE)
corpNR<-Corpus(DataframeSource(xNR))
dtmNR<-DocumentTermMatrix(corpNR)
dtmNR$dimnames$Terms <- colnames(xNR) #add terms to DocTermMetrix
str(dtmNR)
dtmNR$ncol <- length(dtmNR$dimnames$Terms) #give it the right no. of cols

This gives me a DTM of the right side but I'm not sure how to get the correct data for dtmNR$i, dtmNR$j or dtmNR$v 这为我提供了右侧的DTM,但我不确定如何获取dtmNR $ i,dtmNR $ j或dtmNR $ v的正确数据

I also tried 我也试过

library(qdap)
m1 <- as.Corpus(m)
#Error in data.frame(grouping, text.var, check.names = FALSE, stringsAsFactors = FALSE) : 
#  arguments imply differing number of rows: 2062, 1500
#dtm1 <- as.DocumentTermMatrix(m1)

dtm1 <- as.TermDocumentMatrix(m1)
#Error in .TermDocumentMatrix(t(x), weighting) : 
#  argument "weighting" is missing, with no default

Don't write it out to a csv file like that. 不要将其写到这样的csv文件中。

Instead, use save(file='myDTM.RData', list=list(dtm)) # or similar ; 而是使用save(file='myDTM.RData', list=list(dtm)) # or similar and load('myDTM.RData') it later. 并稍后load('myDTM.RData')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM