Sagemaker LDA 主题模型 - 如何访问训练模型的参数？还有一种简单的方法来捕捉连贯性

Question

我是 Sagemaker 的新手，正在运行一些测试来衡量 NTM 和 LDA 在 AWS 上与 LDA 槌和原生 Gensim LDA 模型相比的性能。

我想检查 Sagemaker 上训练有素的模型，并查看诸如哪些词对每个主题的贡献最高之类的东西。 并且还可以衡量模型的一致性。

通过下载输出文件解压缩并解压缩以公开 3 个文件 params、symbol.json 和 meta.json，我已经能够成功地获得对 Sagemaker 上 NTM 的每个主题贡献最大的单词。

但是，当我尝试对 LDA 执行相同的过程时，无法解压缩解压缩的输出文件。

与 NTM 相比，我可能遗漏了一些东西，或者应该为 LDA 做一些不同的事情，但我找不到任何关于此的文档。 另外，有人找到了一种计算模型一致性的简单方法吗？

任何帮助将不胜感激！

Answer 1

这个 SageMaker notebook深入研究了 LDA 的科学细节，还演示了如何检查模型工件。 具体来说，如何获得 Dirichlet 先验alpha和主题词分布矩阵beta的估计值。 您可以在标题为“检查训练模型”的部分中找到说明。 为方便起见，我将在此处复制相关代码：

import tarfile
import mxnet as mx

# extract the tarball
tarflie_fname = FILENAME_PREFIX + 'model.tar.gz' # wherever the tarball is located
with tarfile.open(tarfile_fname) as tar:
    tar.extractall()

# obtain the model file (should be the only file starting with "model_")
model_list = [
    fname
    for fname in os.listdir(FILENAME_PREFIX)
    if fname.startswith('model_')
]
model_fname = model_list[0]

# load the contents of the model file into MXNet arrays
alpha, beta = mx.ndarray.load(model_fname)

这应该为您提供模型数据。 请注意，存储为beta行的主题没有以任何特定顺序显示。

Answer 2

关于一致性，sagemaker AFAIK 中没有默认实现。

您可以像这样实现自己的指标：

from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

def calculate_coherence(topic_vectors):
    similarity_sum = 0.0
    num_combinations = 0
    for pair in combinations(topic_vectors, 2):
        similarity = cosine_similarity([pair[0]], [pair[1]])
        similarity_sum = similarity_sum + similarity
        num_combinations = num_combinations + 1
    return float(similarity_sum / num_combinations)

并获得真实模型的连贯性，例如：

print(calculate_coherence(beta.asnumpy()))

一些直观的一致性测试如下：

predictions = [[0.0, 0.0, 0.0],
               [0.0, 1.0, 0.0],
               [0.0, 0.0, 1.0],
               [1.0, 0.0, 0.0]]

assert calculate_coherence(predictions) == 0.0, "Expected incoherent"

predictions = [[0.0, 1.0, 1.0],
               [0.0, 1.0, 1.0],
               [0.0, 1.0, 1.0],
               [0.0, 1.0, 1.0]]

assert calculate_coherence(predictions) == 1.0, "Expected coherent"

predictions = [[0.0, 0.0, 1.0],
               [0.0, 0.0, 1.0],
               [1.0, 0.0, 0.0],
               [1.0, 0.0, 0.0],
               [0.0, 1.0, 0.0],
               [0.0, 1.0, 0.0]]
assert calculate_coherence(predictions) == 0.2, "Expected partially coherent"

延伸阅读：

http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

Sagemaker LDA 主题模型 - 如何访问训练模型的参数？还有一种简单的方法来捕捉连贯性

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-03-01 21:23:45

解决方案2
0 2020-02-17 10:35:04

Sagemaker LDA 主题模型 - 如何访问训练模型的参数？ 还有一种简单的方法来捕捉连贯性

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-03-01 21:23:45

解决方案2 0 2020-02-17 10:35:04

Sagemaker LDA 主题模型 - 如何访问训练模型的参数？还有一种简单的方法来捕捉连贯性

解决方案1
1 已采纳 2019-03-01 21:23:45

解决方案2
0 2020-02-17 10:35:04