简体   繁体   English

使用Simple Scans进行Spark SQL性能

[英]Spark SQL performance with Simple Scans

I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). 我在3台机器上的集群(独立模式)上使用Spark 1.4,用于类似于TPCH的工作负载(具有多路/多路大型连接和聚合的分析查询)。 Each machine has 12GB of Memory and 4 cores. 每台机器有12GB内存和4个内核。 My total data size is 150GB, stored in HDFS (stored as Hive tables), and I am running my queries through Spark SQL using hive context. 我的总数据大小为150GB,存储在HDFS中(存储为Hive表),我使用hive上下文通过Spark SQL运行查询。 After checking the performance tuning documents on the spark page and some clips from latest spark summit, I decided to set the following configs in my spark-env: 在检查了火花页面上的性能调整文档和最新火花峰会的一些片段之后,我决定在我的spark-env中设置以下配置:

SPARK_WORKER_INSTANCES=4
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2500M

(As my tasks tend to be long so the overhead of starting multiple JVMs, one per worker is much less than the total query times). (由于我的任务往往很长,所以启动多个JVM的开销,每个工作一个JVM远远少于总查询时间)。 As I monitor the job progress, I realized that while the Worker memory is 2.5GB, the executors (one per worker) have max memory of 512MB (which is default). 当我监视作业进度时,我意识到虽然Worker内存为2.5GB,但执行程序(每个worker一个)的最大内存为512MB(默认值)。 I enlarged this value in my application as: 我在我的申请中将此值放大为:

conf.set("spark.executor.memory", "2.5g");

Trying to give max available memory on each worker to its only executor, but I observed that my queries are running slower than the prev case (default 512MB). 尝试将每个worker的最大可用内存提供给它唯一的执行器,但我发现我的查询运行速度比prev的情况慢(默认为512MB)。 Changing 2.5g to 1g improved the performance time, it is close to but still worse than 512MB case. 改变2.5g到1g改善了性能时间,它接近但仍然比512MB的情况更差。 I guess what I am missing here is what is the relationship between the "WORKER_MEMORY" and 'executor.memory'. 我想我在这里缺少的是“WORKER_MEMORY”和“executor.memory”之间的关系。

  • Isn't it the case that WORKER tries to split this memory among its executors (in my case its only executor) ? WORKER是否试图在其执行者(在我的情况下是唯一的执行者)之间分配这种记忆? Or there are other stuff being done worker which need memory ? 还是有其他东西正在做工人需要记忆?

  • What other important parameters I need to look into and tune at this point to get the best response time out of my HW ? 还有哪些其他重要的参数需要在这一点上进行调查和调整,以便从我的硬件中获得最佳响应时间? (I have read about Kryo serializer, and I am about trying that - I am mainly concerned about memory related settings and also knobs related to parallelism of my jobs). (我已经阅读了关于Kryo序列化程序,我正在尝试这一点 - 我主要关注与内存相关的设置以及与我的工作并行性相关的旋钮)。 As an example, for a simple scan-only query, Spark is worse than Hive (almost 3 times slower) while both are scanning the exact same table & file format. 例如,对于简单的仅扫描查询,Spark比Hive更差(几乎慢3倍),而两者都扫描完全相同的表和文件格式。 That is why I believe I am missing some params by leaving them as defaults. 这就是为什么我认为我将它们作为默认值丢失了一些参数。

Any hint/suggestion would be highly appreciated. 任何提示/建议都将受到高度赞赏。

Spark_worker_cores is shared across the instances. Spark_worker_cores在各个实例之间共享。 Increase the cores to say 8 - then you should see the kind of behavior (and performance) that you had anticipated. 将核心增加到8 - 然后您应该看到您预期的行为(和性能)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM