简体   繁体   中英

Select texts by topic (LDA)

Would it be possible to look for texts that are within a certain topic (determined by LDA)?

I have a list of 5 topics with 10 words each, found by using lda.

I have analysed the texts in a dataframe's column. I would like to select/filter rows/texts that are in one specific topic.

If you need more information, I will provide you.

What I am referring to is the step that returns this output:

[(0,
  '0.207*"house" + 0.137*"apartment" + 0.118*"sold" + 0.092*"beach" + '
  '0.057*"kitchen" + 0.049*"rent" + 0.033*"landlord" + 0.026*"year" + '
  '0.024*"bedroom" + 0.023*"home"'),
 (1,
  '0.270*"school" + 0.138*"homeworks" + 0.117*"students" + 0.084*"teacher" + '
  '0.065*"pen" + 0.038*"books" + 0.022*"maths" + 0.020*"exercise" + '
  '0.020*"friends" + 0.020*"college"'),
 ... ]

created by

# LDA Model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto', 
                                           # alpha=[0.01]*num_topics,
                                           per_word_topics=True,
                                           eta=[0.01]*len(id2word.keys()))

Print the Keyword in the 10 topics

from pprint import pprint
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

The original column with texts that have been analysed is called Texts and it looks like:

Texts 

"Children are happy to go to school..."
"The average price for buying a house is ... "
"Our children love parks so we should consider to buy an apartment nearby"

etc etc...

My expected output would be

Texts                                            Topic 
    "Children are happy to go to school..."         2
    "The average price for buying a house is ... "  1
    "Our children love parks so we should consider to buy an apartment nearby"                                   

      2

Thanks

doc_lda contains a list of (topic, score) tuple for each sentence. Hence you can flexibly assign a topic to the sentence using any heuristics, for example a simple heuristic would by assigning the topic which has the maximum score.

We can extract the topic scores of each sentence by doing this:

topic_scores = [[topic_score[1] for topic_score in sent] for sent in doc_lda]

You can also convert the above into a pandas dataframe where each row is a sentence and each column is the topic id. The dataframe data structure usually allows for a flexible and more complex operation on the topic-score sentence relationships

df_topics = pd.DataFrame(topic_scores)

If you just want to assign a single topic which has the maximum score on a sentence, you can do this:

max_topics = [max(sent, key=lambda x: x[1])[0] for sent in doc_lda]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM