简体繁体 English

使用稀疏数据，训练 LDA（潜在狄利克雷分配）并预测新文档的更快方法是什么？

[英]With sparse data,what is the faster way to train LDA( Latent Dirichlet allocation ) and predict for a new document?

原文 2017-09-26 14:29:17 8 1 apache-spark/ machine-learning/ cluster-analysis/ lda/ unsupervised-learning

About training LDA :关于训练LDA ：
When we try to implement LDA, we need to construct a words-documents matrix, but this matrix is sparse data, because our token dictionary is very large(some millions words ), and a documents has a small set of token (~ 1000 - 10000 words), so in words-documents matrix, it has so many 0 values, and it will take longer time to train model.当我们尝试实现 LDA 时，我们需要构造一个词-文档矩阵，但是这个矩阵是稀疏数据，因为我们的词条字典非常大（几百万个词），而一个文档只有一小部分词条（~1000 - 10000个词），所以在words-documents矩阵中，它有这么多的0值，训练模型需要更长的时间。 So how can we do faster ?那么我们怎样才能做得更快呢？
About predict a new documents :关于预测一个新文件：
After training, now we have a new LDA model, so we can use it to predict which topics with a new documents.经过训练，现在我们有了一个新的 LDA 模型，所以我们可以用它来预测哪些主题有新的文档。 But before feed new documents to our model, we need to convert it to a words vector, and its vector length will be our dictionary length ( some millions words).但是在将新文档提供给我们的模型之前，我们需要将其转换为词向量，其向量长度将是我们的字典长度（数百万个词）。 So, it will have some many zeros values, and in fact, our cost time is increased by vector length.因此，它将有许多零值，实际上，我们的成本时间因向量长度而增加。

So is documents-words matrix a effective way to implement LDA ?那么文档词矩阵是实现 LDA 的有效方法吗？ Can we have other better way?我们可以有其他更好的方法吗？ I need some recommend for my project , so please help我需要一些推荐给我的项目，所以请帮忙

1 个解决方案

With sparse data, you should of course use sparse vectors instead of dense vectors.对于稀疏数据，您当然应该使用稀疏向量而不是密集向量。

Instead of storing all the zeros, you only keep the non-zero values.您只保留非零值，而不是存储所有零。

A typical data model (see literature for alternatives) is to simply use a list of tuples (i,v) where i is the column index, and v is a non-zero value.一个典型的数据模型（参见替代方案）是简单地使用元组列表 (i,v)，其中 i 是列索引，v 是非零值。