简体   繁体   中英

Labeled LDA + Guided LDA topic modelling

I am rather new to both machine learning, NLP, and LDA, so I'm not sure if I'm even approaching my problem entirely correctly; but I am attempting to do unsupervised topic modelling with known topics and multiple topic selections. Based on Topic modelling, but with known topics?

I can label every single one of my documents with every single topic, and my unsupervised set effectively becomes supervised (LLDA is a supervised technique).

Reading this paper I've come across some other potential issues - First, my data is organized with categories and sub-categories. According to the paper LLDA is more effective with significant semantic distinction between texts - which I won't particularly have with my relatively close sub-categories. Additionally, the paper notes that LLDA was not designed to be a multi-label classifier.

I'm hoping to remedy these weakness by including the guided part of GuidedLDA (I haven't read a paper on this, but I did read https://medium.freecodecamp.org/how-we-changed-unsupervised-lda-to-semi-supervised-guidedlda-e36a95f3a164 ).

So is there any algorithm (I would assume a modification of LLDA, but again I'm not super well read in this area) that allows one to use some form of intuition to aid an unsupervised topic-model with known topic classes that selects multiple topics?

As for why I don't just use Guided LDA - well I am planning to test it out and see how well it does (alongside LLDA). But its also not designed for multiple labels.

Slight note if it matters - I am actually using documents and words for my data, I've read about LDA being used with other data types.

Further note - I have a fair amount of experience with Python, though I've heard there is a good topic modelling tool called Mallet that I might explore but have yet to look into (maybe it has something for this?)

As you said that you would try out Guided LDA, you can get multiple labels in the following way:

There is a distribution called theta distribution or when we want to get the topic of a document, the output of guided LDA would be an array which has the probability of each topic for each document. We usually take the topic with the highest probability.May be you can set a threshold value according to your problem and select the topics with probability more than that.

This would help you in solving your unsupervised guided Topic modelling with multiple labels problem.

Because you have a set of known topics, it would make sense to use a supervised LDA/LLDA. If you use an unsupervised LDA and label all of the documents with the known topics, it would find associations between the given documents, but they likely wouldn't correlate with the given topics.

I've been creating supervised LDA with mallet and python. Gensim has a wrapper for Mallet's LDA class, but I've had better luck with using python's subprocess to use mallet through the command line. I used David Mimno's post as a starting place.

You can have multiple labels for a documents, the beauty of LDA is that its almost like a fuzzy association nearest neighbors algorithm. The subcategories shouldn't be a problem for an lda because the document can have an association to the parent topic, and subtopics, and they don't need to be evenly distributed. It is very much so a multi-label classifier.

If you really want to use an unsupervised classifier for processing documents, I would recommend using an RNN, a recurrent neural net. It is particularly useful for text/document processing because it looks for association on sequences of data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM