简体   繁体   中英

Spark SQL performance with Simple Scans

I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables), and I am running my queries through Spark SQL using hive context. After checking the performance tuning documents on the spark page and some clips from latest spark summit, I decided to set the following configs in my spark-env:

SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2500M

(As my tasks tend to be long so the overhead of starting multiple JVMs, one per worker is much less than the total query times). As I monitor the job progress, I realized that while the Worker memory is 2.5GB, the executors (one per worker) have max memory of 512MB (which is default). I enlarged this value in my application as:

conf.set("spark.executor.memory", "2.5g");

Trying to give max available memory on each worker to its only executor, but I observed that my queries are running slower than the prev case (default 512MB). Changing 2.5g to 1g improved the performance time, it is close to but still worse than 512MB case. I guess what I am missing here is what is the relationship between the "WORKER_MEMORY" and 'executor.memory'.

  • Isn't it the case that WORKER tries to split this memory among its executors (in my case its only executor) ? Or there are other stuff being done worker which need memory ?

  • What other important parameters I need to look into and tune at this point to get the best response time out of my HW ? (I have read about Kryo serializer, and I am about trying that - I am mainly concerned about memory related settings and also knobs related to parallelism of my jobs). As an example, for a simple scan-only query, Spark is worse than Hive (almost 3 times slower) while both are scanning the exact same table & file format. That is why I believe I am missing some params by leaving them as defaults.

Any hint/suggestion would be highly appreciated.

Spark_worker_cores is shared across the instances. Increase the cores to say 8 - then you should see the kind of behavior (and performance) that you had anticipated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM