简体   繁体   English

Spark Latent Dirichlet分配模型主题矩阵太小

[英]Spark Latent Dirichlet Allocation model topic matrix is too small

First, just in case, I will explain how I represented the documents that I want to run the LDA model on. 首先,以防万一,我将解释如何表示要在其上运行LDA模型的文档。 First, I do some preprocessing to get the most important terms per a person for all their documents, then I get the union of all the most important words. 首先,我进行一些预处理以获取每个人所有文档中最重要的术语,然后将所有最重要的单词合并。

 val text = groupedByPerson.map(s => (s._1,preprocessing.run(s, numWords, stopWords)))
 val unionText = text.flatMap(s=> s._2.map(l => l._2)).toSet

I 'tokenize' all the words in all the documents with a regex expressions, 我用正则表达式对所有文档中的所有单词进行“标记”,

val df: Dataframe = ...
val regexpr = """[a-zA-Z]+""".r
val shaveText = df.select("text").map(row => regexpr.findAllIn(row.getString(0)).toSet)
val unionTextZip = unionText.zipWithIndex.toMap

I also have noticed I need to convert the string 'words' into unique double similar to the example given in the documents before running the LDA model, so I created a map to convert all the words. 我还注意到,在运行LDA模型之前,类似于文档中给出的示例,我需要将字符串'words'转换为唯一的double,因此我创建了一个映射来转换所有单词。

val numbersText = shaveText.map(set => set.map(s => unionTextZip(s).toDouble))

Then I create the corpus 然后我创建语料库

val corpus = numbersText.zipWithIndex.map(s => (s._2, Vectors.dense(s._1.toArray))).cache

Now I run the LDA model 现在我运行LDA模型

 val ldaModel = new LDA().setK(3).run(corpus)

When I check the vocabulary size, I notice that it is set to the size of the first document size in the corpus despite there being documents with larger or smaller vocabularies. 当我检查词汇量时,我注意到它设置为语料库中第一个文档大小的大小,尽管存在词汇量更大或更小的文档。

Therefore the topic matrix will give an error that looks something like this 因此,主题矩阵将给出看起来像这样的错误

Exception in thread "main" java.lang.IndexOutOfBoundsException: (200,0) not in [-31,31) x [-3,3)
    at breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112)
    at org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:544)
    at org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:541)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:541)
    at org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:533)
    at application.main.Main$.main(Main.scala:110)
    at application.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

I thought i could just use the vector to represent a bag of words. 我以为我可以用向量来代表一袋单词。 Do the vectors need to be equal size? 向量是否需要相等大小? That is, create a bolean feature for each word whether it is in the document or not? 也就是说,为每个单词创建bolean功能,无论它是否在文档中?

Long story short, of course the vector needs to be of the same length. 长话短说,向量当然必须具有相同的长度。 The obvious answer is to use a sparse vector. 显而易见的答案是使用稀疏向量。 I used this and its github link for guidance. 我用这个和它的github上的链接进行指导。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM