在Vowpal Wabbit中使用LDA的输出格式

Question

I used VowpalWabbit.LDA to generate topics for some document collection. 我使用VowpalWabbit.LDA生成了一些文档集合的主题。 Output file looks like: 输出文件如下所示：

Version 7.7.0

Min label:0.000000

Max label:1.000000

bits:18

0 pairs: 

0 triples: 

rank:0

lda:10

0 ngram: 

0 skip: 

options: --lda 10

0 21407.330078 1.025429 0.648226 0.917246 0.451278 0.801456 11463.415039 0.876181 1.105704 0.785956 

1 39210.687500 0.814911 0.389153 0.473620 0.391765 0.688513 0.708061 0.526936 0.719026 0.659338 

2 41573.523438 1.161345 0.583391 0.918144 0.318337 0.543920 0.704812 0.987455 0.633980 0.890918 

3 2.759077 1.114242 0.662993 1.113668 0.632519 0.707388 26730.898438 1.074518 0.974116 0.909262 

4 1.476383 1.263869 0.552380 0.838780 0.500615 0.529077 24156.128906 0.689529 1.400310 0.530180 

5 1.083310 0.746087 0.539263 1.152820 0.496213 0.726304 17391.972656 0.809698 1.682978 0.925061 

6 4.601943 1.551102 0.541617 1.532858 0.418091 1.432069 10.024081 1.992290 12924.787109 1.202141

I supposed to see identifier of each word and the probabilities of belonging it to each topic. 我应该看到每个单词的标识符以及将其归入每个主题的概率。 But I see some huge numbers like 21407.330078. 但是我看到一些巨大的数字，例如21407.330078。 Does anybody know how to transform this output format to the format I want to see? 有人知道如何将这种输出格式转换为我想要的格式吗？

Answer 1

It seems you are looking into predictions output file. 看来您正在寻找预测输出文件。 It contains "the inferred per-document topic weights" with following format: "Each line corresponds to a document d. Each column corresponds to a topic k". 它包含“推断的每个文档主题权重”，格式如下：“每一行对应于一个文档d。每一列对应一个主题k”。

If you need information about words you shall add "--readable_model topics.dat" parameter to the command line. 如果您需要有关单词的信息，则应在命令行中添加“-可读模型主题.dat”参数。 This will give you the topics in human-readable format with following content: "Each line corresponds to a topic k. Each column corresponds to a word w" Please refer to https://github.com/JohnLangford/vowpal_wabbit/wiki/lda.pdf 这将为您提供人类可读格式的主题，其中包含以下内容：“每行对应一个主题k。每列对应一个单词w”请参考https://github.com/JohnLangford/vowpal_wabbit/wiki/lda .PDF

Answer 2

If you run vw with -a (audit) you can see the mappings from words to hash values. 如果使用-a（审核）运行vw，则可以看到从单词到哈希值的映射。 I assume you need to normalize each line in the model output and then find the top words for each topic. 我假设您需要规范化模型输出中的每一行，然后找到每个主题的主题词。

在Vowpal Wabbit中使用LDA的输出格式

问题描述

2 个解决方案

解决方案1
2 2014-10-10 20:54:36

解决方案2
0 2015-05-11 23:59:48

在Vowpal Wabbit中使用LDA的输出格式

问题描述

2 个解决方案

解决方案1 2 2014-10-10 20:54:36

解决方案2 0 2015-05-11 23:59:48

解决方案1
2 2014-10-10 20:54:36

解决方案2
0 2015-05-11 23:59:48