减少Apache spark job / application的运行时间

Question

We are trying to implement a simple spark job which reads a CSV file (1 row of data) and makes prediction using prebuilt random forest model object. 我们正在尝试实现一个简单的火花作业，它读取CSV文件（1行数据）并使用预建的随机森林模型对象进行预测。 This job doesn't include any data preprocessing or data manipulation. 此作业不包括任何数据预处理或数据操作。

We are running spark in a standalone mode with the application running locally. 我们在独立模式下运行spark，应用程序在本地运行。 The configuration is as follows: RAM: 8GB Memory: 40GB No. of cores: 2 Spark version: 1.5.2 Scala version: 2.10.5 Input file size: 1KB (1 row of data) Model file size: 1,595 KB (400 trees random forest) 配置如下：RAM：8GB内存：40GB内核数：2 Spark版本：1.5.2 Scala版本：2.10.5输入文件大小：1KB（1行数据）模型文件大小：1,595 KB（400棵树）随机森林）

Currently, the implementation in spark-submit takes about 13 seconds. 目前，spark-submit中的实现大约需要13秒。 However, the run time is a huge concern for this application hence 但是，运行时间是这个应用程序的一个重要问题

Is there a way to optimize the code to bring the run time down to 1 or 2 seconds? 有没有办法优化代码，使运行时间缩短到1或2秒？ (high priority) （高优先级）
We noticed that the actual code's execution takes about 7-8 seconds while boot up and setting contexts take about 5-6 seconds, so is there a way to keep the spark context running while we run the spark-submit. 我们注意到实际代码的执行大约需要7-8秒，而启动和设置上下文大约需要5-6秒，所以有一种方法可以在我们运行spark-submit时保持spark上下文运行。

Here is the application code 这是应用程序代码

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object RF_model_App {
  def main(args: Array[String]) {

val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature4.{RandomForestfeature4Model, RandomForestClassifier}
import org.apache.spark.ml.evaluation.Multiclassfeature4Evaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
import org.apache.spark.sql.functions.udf
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.StringIndexer
import sqlContext.implicits._
val Test = sqlContext.read.format("com.databricks.spark.csv").option("header","true").load("/home/ubuntu/Test.csv")
Test.registerTempTable("Test")
val model_L1 = sc.objectFile[RandomForestfeature4Model]("/home/ubuntu/RF_L1.model").first()

val toInt = udf[Int, String]( _.toInt)
val toDouble = udf[Double, String]( _.toDouble)
val featureDf = Test.withColumn("id1", toInt(Test("id1")))  .withColumn("id2", toInt(Test("id2")))  .withColumn("id3", toInt(Test("id3")))  .withColumn("id4", toInt(Test("id4")))   .withColumn("feature3", toInt(Test("feature3")))   .withColumn("feature9", toInt(Test("feature9")))    .withColumn("feature10", toInt(Test("feature10")))  .withColumn("feature12", toInt(Test("feature12")))  .withColumn("feature14", toDouble(Test("feature14")))   .withColumn("feature15", toDouble(Test("feature15")))   .withColumn("feature16", toInt(Test("feature16")))  .withColumn("feature17", toDouble(Test("feature17")))   .withColumn("feature18", toInt(Test("feature18")))

val feature4_index = new StringIndexer()  .setInputCol("feature4")  .setOutputCol("feature4_index")
val feature6_index = new StringIndexer()  .setInputCol("feature6")  .setOutputCol("feature6_index")
val feature11_index = new StringIndexer()  .setInputCol("feature11")  .setOutputCol("feature11_index")
val feature8_index = new StringIndexer()  .setInputCol("feature8")  .setOutputCol("feature8_index")
val feature13_index = new StringIndexer()  .setInputCol("feature13")  .setOutputCol("feature13_index")
val feature2_index = new StringIndexer()  .setInputCol("feature2")  .setOutputCol("feature2_index")
val feature5_index = new StringIndexer()  .setInputCol("feature5")  .setOutputCol("feature5_index")
val feature7_index = new StringIndexer()  .setInputCol("feature7")  .setOutputCol("feature7_index")
val vectorizer_L1 =  new VectorAssembler()  .setInputCols(Array("feature3",  "feature2_index", "feature6_index", "feature4_index", "feature8_index", "feature7_index", "feature5_index", "feature10", "feature9", "feature12", "feature11_index", "feature13_index", "feature14", "feature15", "feature18", "feature17", "feature16")).setOutputCol("features_L1")
val feature_pipeline_L1 = new Pipeline()  .setStages(Array( feature4_index, feature6_index, feature11_index,feature8_index, feature13_index,  feature2_index, feature5_index, feature7_index,vectorizer_L1))
val testPredict= feature_pipeline_L1.fit(featureDf).transform(featureDf)
val getPOne = udf((v: org.apache.spark.mllib.linalg.Vector) => v(1))
val getid2 = udf((v: Int) => v)
val L1_output = model_L1.transform(testPredict).select(getid2($"id2") as "id2",getid2($"prediction") as "L1_prediction",getPOne($"probability") as "probability")

L1_output.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save("/home/L1_output")

  }
};

Answer 1

Lets start with things which are simply wrong : 让我们从错误的事情开始：

feature mechanism you use is just incorrect. 你使用的功能机制是不正确的。 StringIndexer assigns indices based on the distribution of data so the same record will have different encoding depending on the other records. StringIndexer根据数据分布分配索引，因此相同的记录将根据其他记录具有不同的编码。 You should use the same StringIndexerModel (-s) for training, testing and predictions. 您应该使用相同的StringIndexerModel （-s）进行训练，测试和预测。
val getid2 = udf((v: Int) => v) is just an expensive identity. val getid2 = udf((v: Int) => v)只是一个昂贵的身份。

Persistent SparkContext 持久的SparkContext

There are multiple tools which keep persistent context including job-server or Livy . 有多种工具可以保持持久的上下文，包括job-server或Livy 。

Finally you can simply use Spark Streaming and just process the data as it comes. 最后，您可以简单地使用Spark Streaming，只需处理数据。

Shuffling 洗牌

You are also using repartition to create a single, thus I suppose a one file CSV. 您还使用repartition来创建单个，因此我假设一个文件CSV。 This action is quite expensive, but in definition, it reshuffles the data in the RDD randomly to create either more or fewer partitions and balance it across them. 此操作非常昂贵，但在定义上，它会随机重新调整RDD中的数据，以创建更多或更少的分区并在它们之间进行平衡。 This always shuffles all data over the network. 这总是随机播放网络上的所有数据。

Other considerations : 其他考虑 ：

If latency is important and you use only a single, low performance machine, don't use Spark at all. 如果延迟很重要并且您只使用单个低性能计算机，则根本不要使用Spark。 There is nothing to gain here. 这里没有任何好处。 A good local library can do much better job in case like this. 如果是这样的话，一个好的本地图书馆可以做得更好。

Notes : 备注：

We don't access to your data or your hardware so any requirements like reduce time to 7s are completely meaningless. 我们无法访问您的数据或硬件，因此任何将时间减少到7秒的要求都是完全没有意义的。

减少Apache spark job / application的运行时间

问题描述

1 个解决方案

解决方案1
1

减少Apache spark job / application的运行时间

问题描述

1 个解决方案

解决方案1 1

解决方案1
1