简体   繁体   English

Select 主题文本 (LDA)

[英]Select texts by topic (LDA)

Would it be possible to look for texts that are within a certain topic (determined by LDA)?是否可以查找某个主题内的文本(由 LDA 确定)?

I have a list of 5 topics with 10 words each, found by using lda.我有一个包含 5 个主题的列表,每个主题 10 个单词,是使用 lda 找到的。

I have analysed the texts in a dataframe's column.我已经分析了数据框列中的文本。 I would like to select/filter rows/texts that are in one specific topic.我想选择/过滤某个特定主题中的行/文本。

If you need more information, I will provide you.如果您需要更多信息,我会提供给您。

What I am referring to is the step that returns this output:我指的是返回此 output 的步骤:

[(0,
  '0.207*"house" + 0.137*"apartment" + 0.118*"sold" + 0.092*"beach" + '
  '0.057*"kitchen" + 0.049*"rent" + 0.033*"landlord" + 0.026*"year" + '
  '0.024*"bedroom" + 0.023*"home"'),
 (1,
  '0.270*"school" + 0.138*"homeworks" + 0.117*"students" + 0.084*"teacher" + '
  '0.065*"pen" + 0.038*"books" + 0.022*"maths" + 0.020*"exercise" + '
  '0.020*"friends" + 0.020*"college"'),
 ... ]

created by由...制作

# LDA Model

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto', 
                                           # alpha=[0.01]*num_topics,
                                           per_word_topics=True,
                                           eta=[0.01]*len(id2word.keys()))

Print the Keyword in the 10 topics打印 10 个主题中的关键字

from pprint import pprint
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

The original column with texts that have been analysed is called Texts and it looks like:包含已分析文本的原始列称为Texts ,它看起来像:

Texts 

"Children are happy to go to school..."
"The average price for buying a house is ... "
"Our children love parks so we should consider to buy an apartment nearby"

etc etc...

My expected output would be我预期的 output 将是

Texts                                            Topic 
    "Children are happy to go to school..."         2
    "The average price for buying a house is ... "  1
    "Our children love parks so we should consider to buy an apartment nearby"                                   

      2

Thanks谢谢

doc_lda contains a list of (topic, score) tuple for each sentence. doc_lda包含每个句子的 (topic, score) 元组列表。 Hence you can flexibly assign a topic to the sentence using any heuristics, for example a simple heuristic would by assigning the topic which has the maximum score.因此,您可以使用任何启发式方法灵活地将主题分配给句子,例如简单的启发式方法将分配具有最高分数的主题。

We can extract the topic scores of each sentence by doing this:我们可以通过这样做来提取每个句子的主题分数:

topic_scores = [[topic_score[1] for topic_score in sent] for sent in doc_lda]

You can also convert the above into a pandas dataframe where each row is a sentence and each column is the topic id.您还可以将上述内容转换为 pandas dataframe ,其中每一行是一个句子,每一列是主题 ID。 The dataframe data structure usually allows for a flexible and more complex operation on the topic-score sentence relationships dataframe 数据结构通常允许对主题分数句子关系进行灵活且更复杂的操作

df_topics = pd.DataFrame(topic_scores)

If you just want to assign a single topic which has the maximum score on a sentence, you can do this:如果您只想分配一个句子中得分最高的主题,您可以这样做:

max_topics = [max(sent, key=lambda x: x[1])[0] for sent in doc_lda]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM