简体   繁体   中英

R topic modeling: lda model labeling function

I used LDA to build a topic model for 2 text documents say A and B. document A is highly related to say computer science and document B is highly related to say geo-science. Then I trained an lda using this command :

     text<- c(A,B) # introduced above
     r <- Corpus(VectorSource(text)) # create corpus object
     r <- tm_map(r, tolower) # convert all text to lower case
     r <- tm_map(r, removePunctuation) 
     r <- tm_map(r, removeNumbers)
     r <- tm_map(r, removeWords, stopwords("english"))
     r.dtm <- TermDocumentMatrix(r, control = list(minWordLength = 3))    
     my_lda <- LDA(r.dtm,2)

now i want to use my_lda to predict the context of a new document say C and i want to see if it is related to computer Science or geo-science. i know if i use this code for prediction

     x<-C# a new document (a long string) introduced above for prediction
     rp <- Corpus(VectorSource(x)) # create corpus object
     rp <- tm_map(rp, tolower) # convert all text to lower case
     rp <- tm_map(rp, removePunctuation) 
     rp <- tm_map(rp, removeNumbers)
     rp <- tm_map(rp, removeWords, stopwords("english"))
     rp.dtm <- TermDocumentMatrix(rp, control = list(minWordLength = 3))    
     test.topics <- posterior(my_lda,rp.dtm)

It will give me a label 1 or 2 and I don't have any idea what 1 or 2 represents... How can I realize if it means computer science related or geo-science related?

You can extract the most likely terms from your LDA topicmodel and replace those black-box numeric names with however many of them you would like. Your example isn't reproducible, but here is example illustrating how you can do this:

> library(topicmodels)
> data(AssociatedPress)
> 
> train <- AssociatedPress[1:100]
> test <- AssociatedPress[101:150]
> 
> train.lda <- LDA(train,2)
> 
> #returns those black box names
> test.topics <- posterior(train.lda,test)$topics
> head(test.topics)
              1           2
[1,] 0.57245696 0.427543038
[2,] 0.56281568 0.437184320
[3,] 0.99486888 0.005131122
[4,] 0.45298547 0.547014530
[5,] 0.72006712 0.279932882
[6,] 0.03164725 0.968352746
> #extract top 5 terms for each topic and assign as variable names
> colnames(test.topics) <- apply(terms(train.lda,5),2,paste,collapse=",")
> head(test.topics)
     percent,year,i,new,last new,people,i,soviet,states
[1,]              0.57245696                0.427543038
[2,]              0.56281568                0.437184320
[3,]              0.99486888                0.005131122
[4,]              0.45298547                0.547014530
[5,]              0.72006712                0.279932882
[6,]              0.03164725                0.968352746
> #round to one topic if you'd prefer
> test.topics <- apply(test.topics,1,function(x) colnames(test.topics)[which.max(x)])
> head(test.topics)
[1] "percent,year,i,new,last"    "percent,year,i,new,last"    "percent,year,i,new,last"   
[4] "new,people,i,soviet,states" "percent,year,i,new,last"    "new,people,i,soviet,states"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM