简体   繁体   English

使用稀疏数据,训练 LDA(潜在狄利克雷分配)并预测新文档的更快方法是什么?

[英]With sparse data,what is the faster way to train LDA( Latent Dirichlet allocation ) and predict for a new document?

  1. About training LDA :关于训练LDA

    When we try to implement LDA, we need to construct a words-documents matrix, but this matrix is sparse data, because our token dictionary is very large(some millions words ), and a documents has a small set of token (~ 1000 - 10000 words), so in words-documents matrix, it has so many 0 values, and it will take longer time to train model.当我们尝试实现 LDA 时,我们需要构造一个词-文档矩阵,但是这个矩阵是稀疏数据,因为我们的词条字典非常大(几百万个词),而一个文档只有一小部分词条(~1000 - 10000个词),所以在words-documents矩阵中,它有这么多的0值,训练模型需要更长的时间。 So how can we do faster ?那么我们怎样才能做得更快呢?

  2. About predict a new documents :关于预测一个新文件

    After training, now we have a new LDA model, so we can use it to predict which topics with a new documents.经过训练,现在我们有了一个新的 LDA 模型,所以我们可以用它来预测哪些主题有新的文档。 But before feed new documents to our model, we need to convert it to a words vector, and its vector length will be our dictionary length ( some millions words).但是在将新文​​档提供给我们的模型之前,我们需要将其转换为词向量,其向量长度将是我们的字典长度(数百万个词)。 So, it will have some many zeros values, and in fact, our cost time is increased by vector length.因此,它将有许多零值,实际上,我们的成本时间因向量长度而增加。

So is documents-words matrix a effective way to implement LDA ?那么文档词矩阵是实现 LDA 的有效方法吗? Can we have other better way?我们可以有其他更好的方法吗? I need some recommend for my project , so please help我需要一些推荐给我的项目,所以请帮忙

With sparse data, you should of course use sparse vectors instead of dense vectors.对于稀疏数据,您当然应该使用稀疏向量而不是密集向量。

Instead of storing all the zeros, you only keep the non-zero values.您只保留非零值,而不是存储所有零。

A typical data model (see literature for alternatives) is to simply use a list of tuples (i,v) where i is the column index, and v is a non-zero value.一个典型的数据模型(参见替代方案)是简单地使用元组列表 (i,v),其中 i 是列索引,v 是非零值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM