简体   繁体   中英

Topic label of each document in LDA model using textmineR

I'm using textmineR to fit a LDA model to documents similar to https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html . Is it possible to get the topic label for each document in the data set?

>library(textmineR)
>data(nih_sample)
> # create a document term matrix 
> dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,doc_names = 
 nih_sample$APPLICATION_ID, ngram_window = c(1, 2), stopword_vec = 
 c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),lower 
 = TRUE, remove_punctuation = TRUE,remove_numbers = TRUE, verbose = FALSE, 
 cpus = 2) 
 >dtm <- dtm[,colSums(dtm) > 2]
 >set.seed(123)
 > model <- FitLdaModel(dtm = dtm, k = 20,iterations = 200,burnin = 
 180,alpha = 0.1, beta = 0.05, optimize_alpha = TRUE, calc_likelihood = 
 TRUE,calc_coherence = TRUE,calc_r2 = TRUE,cpus = 2)

then adding the labels to the model:

 > model$labels <- LabelTopics(assignments = model$theta > 0.05, dtm = dtm, 
   M = 1)

now I want the topic labels for each of 100 document in nih_sample$ABSTRACT_TEXT

Are you looking to label each document by the label of its most prevalent topic? IF so, this is how you could do it:

# convert labels to a data frame so we can merge 
label_df <- data.frame(topic = rownames(model$labels), label = model$labels, stringsAsFactors = FALSE)

# get the top topic for each document
top_topics <- apply(model$theta, 1, function(x) names(x)[which.max(x)][1])

# convert the top topics for each document so we can merge
top_topics <- data.frame(document = names(top_topics), top_topic = top_topics, stringsAsFactors = FALSE)

# merge together. Now each document has a label from its top topic
top_topics <- merge(top_topics, label_df, by.x = "top_topic", by.y = "topic", all.x = TRUE)

This kind of throws away some information that you'd get from LDA though. One advantage of LDA is that each document can have more than one topic. Another is that we can see how much of each topic is in that document. You can do that here by

# set the plot margins to see the labels on the bottom
par(mar = c(8.1,4.1,4.1,2.1))

# barplot the first document's topic distribution with labels
barplot(model$theta[1,], names.arg = model$labels, las = 2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM