Apache Spark MLLib - 使用IDF-TF向量运行KMeans - Java堆空间

Question

I'm trying to run a KMeans on MLLib from a (large) collection of text documents (TF-IDF vectors). 我试图从（大）文本文档集合（TF-IDF向量）上运行MLLib上的KMeans。 Documents are sent through a Lucene English analyzer, and sparse vectors are created from HashingTF.transform() function. 文档通过Lucene英文分析器发送，稀疏矢量从HashingTF.transform（）函数创建。 Whatever the degree of parrallelism I'm using (through the coalesce function), KMeans.train always return an OutOfMemory exception below. 无论我使用的并行度（通过合并函数），KMeans.train总是返回下面的OutOfMemory异常。 Any thought on how to tackle this issue ? 有关如何解决这个问题的想法？

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
at scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)
at breeze.linalg.Vector$class.toArray(Vector.scala:80)
at breeze.linalg.SparseVector.toArray(SparseVector.scala:48)
at breeze.linalg.Vector$class.toDenseVector(Vector.scala:75)
at breeze.linalg.SparseVector.toDenseVector(SparseVector.scala:48)
at breeze.linalg.Vector$class.toDenseVector$mcD$sp(Vector.scala:74)
at breeze.linalg.SparseVector.toDenseVector$mcD$sp(SparseVector.scala:48)
at org.apache.spark.mllib.clustering.BreezeVectorWithNorm.toDense(KMeans.scala:422)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$initKMeansParallel$1.apply(KMeans.scala:285)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$initKMeansParallel$1.apply(KMeans.scala:284)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:284)
at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)

Answer 1

After some investigations, it turns out that this issue was related to new HashingTF().transform(v) method. 经过一些调查后发现，这个问题与new HashingTF().transform(v)方法有关。 Although creating sparse vectors using hashing trick is really helpful (especially when the number of features is not known), vector must be kept sparse. 虽然使用散列技巧创建稀疏向量确实很有用（特别是当未知特征的数量时），向量必须保持稀疏。 Default size for HashingTF vectors is 2^20. HashingTF向量的默认大小为2 ^ 20。 Given a 64bits double precision, each vector would theoretically require 8MB when converted to Dense vector - regardless the dimension reduction we could apply. 给定64位双精度，理论上每个向量在转换为密集向量时需要8MB - 无论我们可以应用的维数减少。

Sadly, KMeans uses toDense method (at least for the cluster centers), therefore causing OutOfMemory error (imagine with k = 1000). 遗憾的是，KMeans使用toDense方法（至少对于集群中心），因此导致OutOfMemory错误（想象k = 1000）。

  private def initRandom(data: RDD[BreezeVectorWithNorm]) : Array[Array[BreezeVectorWithNorm]] = {
    val sample = data.takeSample(true, runs * k, new XORShiftRandom().nextInt()).toSeq
    Array.tabulate(runs)(r => sample.slice(r * k, (r + 1) * k).map { v =>
      new BreezeVectorWithNorm(v.vector.toDenseVector, v.norm)
    }.toArray)
  }

Apache Spark MLLib - 使用IDF-TF向量运行KMeans - Java堆空间

问题描述

1 个解决方案

解决方案1
3 2014-10-21 19:38:49

Apache Spark MLLib - 使用IDF-TF向量运行KMeans - Java堆空间

问题描述

1 个解决方案

解决方案1 3 2014-10-21 19:38:49

解决方案1
3 2014-10-21 19:38:49