简体   繁体   中英

Taking a latent semantic analysis (lsa) object and scoring on new data in R

I am running latent semantic analysis (LSA) using textmineR in R. What I'm hoping to get is the document by topic matrix with topics scores by document, which I can do by calling theta from my lsa object (below). However, I am running into challenges taking my created lsa object and using it to score a new dataset (ie document term matrix, dtm) so that I can apply my pre-existing topic structures on new data. In the example below, I create two topics, and then when I try to use the same exact dtm (pretending it is a new file for the sake of this example), I get the following error:

"Error in predict.lsa_topic_model(model, dtm_m) : newdata must be a matrix of class dgCMatrix or a numeric vector"

I need to use a lsa object to score new text. Is there an easy fix that I'm missing? I haven't had good luck coercing the matrix to a "dgCMatrix". I actually am not aware how to do this with other packages like lsa either. Any help on this approach would be greatly appreciated.

file = as.data.frame(matrix( c('case1', 'this is some SAMPLE TEXT!',
'case2',  'and this is the 2nd version of that text...', 
'case3', 'more stuff to talk about'), 
        nrow=3,              
        ncol=2,              
        byrow = TRUE))
names(file) [1] <- 'doc_id'
names(file) [2] <- 'text'

library(tm)
wordCorpus <- Corpus(DataframeSource(file))

cleaner <- function (wordCorpus) {
  wordCorpus <- tm_map(wordCorpus, removeNumbers)
  wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
  wordCorpus <- tm_map(wordCorpus, removePunctuation)
  return (wordCorpus)
}
wordCorpus <- cleaner (wordCorpus)

tokenizer <- function(x) 
  NGramTokenizer(x, Weka_control(min = 1, max = 2))
dtm  <- DocumentTermMatrix (wordCorpus, control = list (tokenize=tokenizer, weighting = weightTfIdf))
dtm_m <- as.matrix(dtm)

library(textmineR)
model <- FitLsaModel(dtm = dtm_m,  k = 2)

#this is what I want to get, but ideally also 
#be able to save the "model" object and use to create this in a new sample`

values <- as.data.frame (model$theta)
values
#pretending my original dataset is a new sample and using predict
values_other <- predict (model, dtm_m)

For workflows like this, you can pretty safely skip using tm altogether and just use textmineR 's CreateDtm function directly.

See the LSA example as part of textmineR 's vignette, which shows this exact workflow. https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM