简体   繁体   English

LDAvis 可以分析 vowpal_wabbit LDA 的结果吗?

[英]Can LDAvis analyse the results of vowpal_wabbit LDA?

LDAvis provides a excellent way of visualsing and exploring topic models. LDAvis提供了一种出色的可视化和探索主题模型的方法。 LDAvis requires 5 parameters: LDAvis 需要 5 个参数:

  1. phi (matrix with dimensions number of terms times number of topics) phi (维度数乘以主题数的矩阵)
  2. theta (matrix with dimensions number of documents times number of topics) theta (具有维度的矩阵文档数乘以主题数)
  3. number of words per document (integer vector)每个文档的字数(整数向量)
  4. the vocabulary (character vector)词汇表(特征向量)
  5. the word frequency in the whole corpus (integer vector)整个语料库中的词频(整数向量)

The short version of my question is: after fitting a LDA model with vowpal wabbit, how do one derive phi and theta?我的问题的简短版本是:在使用 vowpal wabbit 安装 LDA model 后,如何导出 phi 和 theta?

theta represents the mixture of topics per document, and must thus sum to 1 per document. theta 代表每个文档的主题混合,因此每个文档的总和必须为 1。 phi represents the probability of a term given the topic, and must thus sum to 1 per topic. phi 表示给定主题的术语的概率,因此每个主题的总和必须为 1。

After running LDA with vowpal wabbit ( vw ) some kind of weights are stored in a model.在使用vowpal wabbit ( vw ) 运行 LDA 后,某种权重存储在 model 中。 A human readable version of that model can be aquired by feeding a special file, with one document per term in the vocabulary while inactivating learning (by the -t parameter), eg model 的人类可读版本可以通过提供一个特殊文件来获取,词汇表中每个术语一个文档,同时停用学习(通过-t参数),例如

vw -t -i weights -d dictionary.vw --readable_model readable.model.txt

According to the documentation of vowpal wabbit , all columns expect the first one of readable.model.txt now "represent the per-word topic distributions."根据vowpal wabbit 的文档,所有列都期望第一个readable.model.txt现在“代表每个单词的主题分布”。

You can also generate predictions with vw , ie for a collection of documents您还可以使用vw生成预测,即针对文档集合

vw -t -i weights -d some-documents.txt -p predictions.txt

Both predictions.txt and readable.model.txt has a dimension that reflects the number of inputs (rows) and number of topics (columns), and none of them are probability distributions , because they do not sum to 1 (neither per row, nor per column). predictions.txtreadable.model.txt都有一个维度反映输入(行)和主题(列)的数量,它们都不是概率分布,因为它们的总和不等于 1(每行都不是,也不是每列)。

I understand that vw is not for the faint hearted and that some programming/scripting will be required on my part, but I'm sure there must be some way to derive theta and phi from some the output of vw .我知道vw不适合胆小的人,我需要一些编程/脚本,但我确信必须有某种方法从vw的一些 output 派生 theta 和 phi。 I've been stuck on this problem for days now, please give me some hints.我已经被这个问题困扰了好几天了,请给我一些提示。

I don't know how to directly use pyLDAvis with Vowpal Wabbit.我不知道如何直接将 pyLDAvis 与 Vowpal Wabbit 一起使用。 However, as you are already using a python tool you could use the Gensim wrapper and pyLDAvis together.但是,由于您已经在使用 python 工具,您可以同时使用 Gensim 包装器和 pyLDAvis。

The python wrapper for VowpalWabbit was offered in gensim (< 4.0.0).用于 VowpalWabbit 的 python 包装器在gensim (< 4.0.0) 中提供。 You can simply use Gensim as if you would have trained the model by Gensim itself after using vwmodel2ldamodel .您可以简单地使用 Gensim,就好像您在使用vwmodel2ldamodel后通过 Gensim 本身训练了 model 一样。

This workaround might be the easiest way if you are not familiar with the internals of Vowpal Wabbit (and LDA in general).如果您不熟悉 Vowpal Wabbit(以及一般的 LDA)的内部结构,此解决方法可能是最简单的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM