Spark-submit job performance

Question

I am currently running spark-submit on the following environment:

Single node ( RAM : 40GB, VCores : 8, Spark Version : 2.0.2, Python : 3.5)

My pyspark program basically will read one 450MB unstructured file from HDFS. Then it will loop through each lines and grab the necessary data and place it list. Finally it will use createDataFrame and save the data frame into Hive table.

My pyspark program code snippet:

sparkSession = (SparkSession
.builder
.master("yarn")
.appName("FileProcessing")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate())

lines = sparkSession.read.text('/user/test/testfiles').collect()

for line in lines:
    // perform some data extrating and place it into rowList and colList using normal python operation

df = sparkSession.createDataFrame(rowList, colList)

df.registerTempTable("tempTable")
sparkSession.sql("create table test as select * from tempTable");

My spark-submit command is as the following:

spark-submit --master yarn --deploy-mode cluster --num-executors 2 --driver-memory 4g --executor-memory 8g --executor-cores 3 --files /usr/lib/spark-2.0.2-bin-hadoop2.7/conf/hive-site.xml FileProcessing.py

It took around 5 minutes to complete the processing. Is the performance consider good? How can I tune it in terms of setting the executor memory and executor cores so that the process can complete within 1-2 minutes, is it possible?

Appreciate your response. Thanks.

Answer 1

For tuning you application you need to know few things

1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created

Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.

2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application

Form Spark point of you

In spark-defaults.conf

you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.

Below are few Example you can tune this parameter based on your requirements

spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              5g
spark.executor.memory            3g
spark.executor.extraJavaOptions  -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions    -XX:MaxPermSize=6G -XX:+UseG1GC

For More details refer http://spark.apache.org/docs/latest/tuning.html

Hope this Helps!!

Spark-submit job performance

Question

1 answers

solution1
0 ACCPTED 2017-02-08 11:54:09

Spark-submit job performance

Question

1 answers

solution1 0 ACCPTED 2017-02-08 11:54:09

solution1
0 ACCPTED 2017-02-08 11:54:09