简体   繁体   English

Spark mllib中的Java堆空间

[英]Java heap space in spark mllib

I have the following code which runs computes some metrics by cross-validation for a random forest classification. 我有以下代码,该代码通过交叉验证为随机森林分类计算一些指标。

def run(data:RDD[LabeledPoint], metric:String = "PR") = {

    val cv_data:Array[(RDD[LabeledPoint], RDD[LabeledPoint])] = MLUtils.kFold(data, numFolds, 0)

    val result : Array[(Double, Double)] = cv_data.par.map{case (training, validation) =>
      training.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
      validation.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

      val res :ParArray[(Double, Double)] = CV_params.par.zipWithIndex.map { case (p,i) =>
        // Training classifier
        val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo, params(0).asInstanceOf[Int], params(3).asInstanceOf[String], params(4).asInstanceOf[String],
  params(1).asInstanceOf[Int], params(2).asInstanceOf[Int])
        // Prediction
        val labelAndPreds:RDD[(Double, Double)] = model.predictWithLabels(validation)
        // Metrics computation
        val bcm = new BinaryClassificationMetrics(labelAndPreds)
        (bcm.areaUnderROC() / numFolds, bcm.areaUnderPR() / numFolds)
      }

      training.unpersist()
      validation.unpersist()
      res
    }.reduce((s1,s2) => s1.zip(s2).map(t => (t._1._1 + t._2._1, t._1._2 + t._2._2))).toArray

    val cv_roc = result.map(_._1)
    val cv_pr = result.map(_._2)

    // Extract best params
    val which_max = (metric match {
      case "ROC" => cv_roc
      case "PR" => cv_pr
      case _ =>
        logWarning("Metrics set to default one: PR")
        cv_pr
    }).zipWithIndex.maxBy(_._1)._2

    best_values_array = CV_params(which_max)
    CV_areaUnderROC = cv_roc
    CV_areaUnderPR = cv_pr
  }
}

val numTrees = Array(50)
val maxDepth = Array(30)
val maxBins = Array(100)
val featureSubsetStrategy = Array("sqrt")
val impurity = Array("gini")

val CV_params: Array[Array[Any]] = {
    for (a <- numTrees; b <- maxDepth; c <- maxBins; d <- featureSubsetStrategy;
         e <- impurityString) yield Array(a, b, c, d, e)
}

run(data, "PR")

It runs on a YARN cluster on 50 containers (26GB of memory in total). 它在50个容器(总共26GB的内存)的YARN群集上运行。 the data parameter is an RDD[LabeledPoint] . data参数是RDD[LabeledPoint] I use kryo serialization and a default level of parallelism of 1000. 我使用kryo序列化,默认并行度为1000。

For a low size of data , it works but for my real data of size 600 000, I obtain the following error: 对于较小的data ,它可以工作,但是对于我的600 000的实际数据,我得到以下错误:

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1841)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1533)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)

I can't figure where the error comes from, because the total allocated memory (26GB) is much higher than the consumed one during the job (I have checked on the spark web UI). 我无法确定错误的来源,因为总分配的内存(26GB)远远高于工作期间消耗的内存(我已经在spark Web UI上进行了检查)。

Any help would be appreciated. 任何帮助,将不胜感激。 Thank you! 谢谢!

Just a guess, but one unusual thing you are doing is submitting many jobs in parallel with your call to .par . 只是一个猜测,但是您正在做的一件不寻常的事情是在调用.par同时提交许多作业。 Note that spark normally achieves parallelism a different way -- you submit one job, but that job is broken into a number of tasks which can be run in parallel. 请注意,spark通常以不同的方式实现并行性-提交一个作业,但是该作业分为许多可以并行运行的任务。

There is nothing wrong, in principle, with what you are doing, this can be useful if the parallelization within one job is small; 原则上,您所做的没有什么错,如果一个作业的并行度很小,这可能会很有用。 in that case, you wouldn't be making effective use of the cluster if you submitted one job at a time. 在这种情况下,如果一次提交一份作业,您将无法有效利用群集。 OTOH, just using .par may result in too many jobs being submitted in parallel. OTOH,仅使用.par可能会导致并行提交太多作业。 That convenience method will keep submitting jobs to try to keep the driver busy (to a first approximation anyway); 这种方便的方法将继续提交作业,以尝试使驾驶员忙(无论如何近似)。 but in fact, in spark its not unusual for the driver to be relatively idle while its waiting for your cluster to do the heavy lifting. 但实际上,在火花等待驾驶员群集进行繁重的工作时,驾驶员相对闲置并不罕见。 So while the driver may have plenty of cpu available, its possible its using a lot of memory just because its doing the book-keeping required for preparing 1000 jobs simultaneously (not sure how many jobs you are generating). 因此,尽管驱动程序可能有大量的cpu可用,但可能由于使用驱动程序同时进行准备1000个作业(不知道要生成多少个作业)的簿记工作而占用大量内存。

If you do want to submit jobs in parallel, it may help to limit it to a small number, eg. 如果您确实希望并行提交作业,则可以将其限制为少量,例如。 only 2 or 4 jobs at a time. 一次仅2或4个工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM