简体   繁体   English

Sagemaker LDA 主题模型 - 如何访问训练模型的参数? 还有一种简单的方法来捕捉连贯性

[英]Sagemaker LDA topic model - how to access the params of the trained model? Also is there a simple way to capture coherence

I'm new to Sagemaker and am running some tests to measure the performance of NTM and LDA on AWS compared with LDA mallet and native Gensim LDA model.我是 Sagemaker 的新手,正在运行一些测试来衡量 NTM 和 LDA 在 AWS 上与 LDA 槌和原生 Gensim LDA 模型相比的性能。

I'm wanting to inspect the trained models on Sagemaker and look at stuff like what words have the highest contribution for each topic.我想检查 Sagemaker 上训练有素的模型,并查看诸如哪些词对每个主题的贡献最高之类的东西。 And also to get a measure of model coherence.并且还可以衡量模型的一致性。

I have been able to successfully get what words have the highest contribution for each topic for NTM on Sagemaker by downloading the output file untarring it and unzipping to expose 3 files params, symbol.json and meta.json.通过下载输出文件解压缩并解压缩以公开 3 个文件 params、symbol.json 和 meta.json,我已经能够成功地获得对 Sagemaker 上 NTM 的每个主题贡献最大的单词。

However, when I try to do the same process for LDA, the untarred output file cannot be unzipped.但是,当我尝试对 LDA 执行相同的过程时,无法解压缩解压缩的输出文件。

Maybe I'm missing something or should do something different for LDA compared with NTM but I have not been able to find any documentation on this.与 NTM 相比,我可能遗漏了一些东西,或者应该为 LDA 做一些不同的事情,但我找不到任何关于此的文档。 Also, anyone found a simple way to calculate model coherence?另外,有人找到了一种计算模型一致性的简单方法吗?

Any assistance would be greatly appreciated!任何帮助将不胜感激!

This SageMaker notebook , which dives into the scientific details of LDA, also demonstrates how to inspect the model artifacts. 这个 SageMaker notebook深入研究了 LDA 的科学细节,还演示了如何检查模型工件。 Specifically, how to obtain the estimates for the Dirichlet prior alpha and the topic-word distribution matrix beta .具体来说,如何获得 Dirichlet 先验alpha和主题词分布矩阵beta的估计值。 You can find the instructions in the section titled "Inspecting the Trained Model" .您可以在标题为“检查训练模型”的部分中找到说明。 For convenience, I will reproduce the relevant code here:为方便起见,我将在此处复制相关代码:

import tarfile
import mxnet as mx

# extract the tarball
tarflie_fname = FILENAME_PREFIX + 'model.tar.gz' # wherever the tarball is located
with tarfile.open(tarfile_fname) as tar:
    tar.extractall()

# obtain the model file (should be the only file starting with "model_")
model_list = [
    fname
    for fname in os.listdir(FILENAME_PREFIX)
    if fname.startswith('model_')
]
model_fname = model_list[0]

# load the contents of the model file into MXNet arrays
alpha, beta = mx.ndarray.load(model_fname)

That should get you the model data.这应该为您提供模型数据。 Note that the topics, which are stored as rows of beta , are not presented in any particular order.请注意,存储为beta行的主题没有以任何特定顺序显示。

Regarding coherence, there's no default implementation in sagemaker AFAIK.关于一致性,sagemaker AFAIK 中没有默认实现。

You can implement you own metric like this:您可以像这样实现自己的指标:

from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

def calculate_coherence(topic_vectors):
    similarity_sum = 0.0
    num_combinations = 0
    for pair in combinations(topic_vectors, 2):
        similarity = cosine_similarity([pair[0]], [pair[1]])
        similarity_sum = similarity_sum + similarity
        num_combinations = num_combinations + 1
    return float(similarity_sum / num_combinations)

and get the coherence for your real model like:并获得真实模型的连贯性,例如:

print(calculate_coherence(beta.asnumpy()))

Some intuitive tests for coherence as follows:一些直观的一致性测试如下:

predictions = [[0.0, 0.0, 0.0],
               [0.0, 1.0, 0.0],
               [0.0, 0.0, 1.0],
               [1.0, 0.0, 0.0]]

assert calculate_coherence(predictions) == 0.0, "Expected incoherent"

predictions = [[0.0, 1.0, 1.0],
               [0.0, 1.0, 1.0],
               [0.0, 1.0, 1.0],
               [0.0, 1.0, 1.0]]

assert calculate_coherence(predictions) == 1.0, "Expected coherent"

predictions = [[0.0, 0.0, 1.0],
               [0.0, 0.0, 1.0],
               [1.0, 0.0, 0.0],
               [1.0, 0.0, 0.0],
               [0.0, 1.0, 0.0],
               [0.0, 1.0, 0.0]]
assert calculate_coherence(predictions) == 0.2, "Expected partially coherent"

Futher reading:延伸阅读:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM