Gensim LDA 中的主题明智的文档分布

Question

Is there a way in python to map documents belonging to a certain topic. python中有没有办法映射属于某个主题的文档。 For example a list of documents that are primarily "Topic 0".例如，主要是“主题 0”的文档列表。 I know there are ways to list topics for each document but how do I do it the other way around?我知道有多种方法可以列出每个文档的主题，但我该如何反其道而行之？

Edit:编辑：

I am using the following script for LDA:我正在为 LDA 使用以下脚本：

    doc_set = []
    for file in files:
        newpath = (os.path.join(my_path, file)) 
        newpath1 = textract.process(newpath)
        newpath2 = newpath1.decode("utf-8")
        doc_set.append(newpath2)

    texts = []
    for i in doc_set:
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        stopped_tokens = [i for i in tokens if not i in stopwords.words()]
        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
        texts.append(stemmed_tokens)

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, random_state=0, id2word = dictionary, passes=1)

Answer 1

You've got a tool/API (Gensim LDA) that, when given a document, gives you a list of topics.你有一个工具/API (Gensim LDA)，当给定一个文档时，它会给你一个主题列表。

But you want the reverse: a list of documents, for a topic.但是你想要相反的：一个文档列表，一个主题。

Essentially, you'll want to build the reverse-mapping yourself.本质上，您需要自己构建反向映射。

Fortunately Python's native dicts & idioms for working with mapping make this pretty simple - just a few lines of code - as long as you're working with data that fully fits in memory.幸运的是，Python 用于处理映射的本机字典和习语使这变得非常简单 - 只需几行代码 - 只要您正在处理完全适合内存的数据。

Very roughly the approach would be:非常粗略的方法是：

create a new structure ( dict or list ) for mapping topics to lists-of-documents创建一个新的结构（ dict或list ）以将主题映射到文档列表
iterate over all docs, adding them (perhaps with scores) to that topic-to-docs mapping迭代所有文档，将它们（可能带有分数）添加到该主题到文档的映射中
finally, look up (& perhaps sort) those lists-of-docs, for each topic of interest最后，针对每个感兴趣的主题查找（或者排序）这些文档列表

If your question could be edited to include more information about the format/IDs of your documents/topics, and how you've trained your LDA model, this answer could be expanded with more specific example code to build the kind of reverse-mapping you'd need.如果可以编辑您的问题以包含有关文档/主题的格式/ID 以及您如何训练 LDA 模型的更多信息，则可以使用更具体的示例代码来扩展此答案，以构建您的反向映射类型需要。

Update for your code update:更新您的代码更新：

OK, if your model is in ldamodel and your BOW-formatted docs in corpus , you'd do something like:好的，如果您的模型在ldamodel并且您的 BOW 格式的文档在corpus ，您将执行以下操作：

# setup: get the model's topics in their native ordering...
all_topics = ldamodel.print_topics()
# ...then create a empty list per topic to collect the docs:
docs_per_topic = [[] for _ in all_topics]

# now, for every doc...
for doc_id, doc_bow in enumerate(corpus):
    # ...get its topics...
    doc_topics = ldamodel.get_document_topics(doc_bow)
    # ...& for each of its topics...
    for topic_id, score in doc_topics:
        # ...add the doc_id & its score to the topic's doc list
        docs_per_topic[topic_id].append((doc_id, score))

After this, you can see the list of all (doc_id, score) values for a certain topic like this (for topic 0):在此之后，您可以看到特定主题的所有(doc_id, score)值列表，如下所示（对于主题 0）：

print(docs_per_topic[0])

If you're interested in the top docs per topic, you can further sort each list's pairs by their score:如果您对每个主题的热门文档感兴趣，您可以按分数进一步对每个列表的对进行排序：

for doc_list in docs_per_topic:
    doc_list.sort(key=lambda id_and_score: id_and_score[1], reverse=True)

Then, you could get the top-10 docs for topic 0 like:然后，您可以获得主题 0 的前 10 个文档，例如：

print(docs_per_topic[0][:10])

Note that this does everything using all-in-memory lists, which might become impractical for very-large corpuses.请注意，这使用全内存列表执行所有操作，这对于非常大的语料库可能变得不切实际。 In some cases, you might need to compile the per-topic listings into disk-backed structures, like files or a database.在某些情况下，您可能需要将每个主题的列表编译成磁盘支持的结构，如文件或数据库。

Gensim LDA 中的主题明智的文档分布

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-09-07 19:54:46

Gensim LDA 中的主题明智的文档分布

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-09-07 19:54:46

解决方案1
1 已采纳 2020-09-07 19:54:46