EMR 上 Spark 批处理作业的优化

Question

We are running a spark-job on the EMR cluster with the Cluster configuration as given below.我们正在 EMR 集群上运行一个 spark-job，集群配置如下所示。

Resources:
Node Type:CORE - 2 INSTANCES OF
r4.8xlarge
32 vCore, 244 GiB memory, EBS only storage
EBS Storage:32 GiB

Node Type: MASTER
1 Instance of r4.4xlarge
16 vCore, 122 GiB memory, EBS only storage
EBS Storage:32 GiB

Node Type: TASK- 
2 INSTANCES Of 
r4.4xlarge
16 vCore, 122 GiB memory, EBS only storage
EBS Storage:32 GiB

We are doing spark-submit by using the following arguments on the EMR Console:我们在 EMR 控制台上使用以下参数进行 spark-submit：

/usr/bin/spark-submit --deploy-mode cluster --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true --conf spark.sql.files.ignoreCorruptFiles=true --driver-memory 5g --master yarn --class class_name s3://location_of_jar -c s3://location of input to jar -w xyz.json

We feel these arguments are not leveraging the use of available full available resources.我们认为这些论点没有利用可用的全部可用资源。 Can any one please suggest is there any other optimized way to do spark-submit on EMR by changing any of the spark-defaults.conf file or by passing more arguments so that there is optimum utilization of all available resources?任何人都可以通过更改任何 spark-defaults.conf 文件或传递更多参数来建议是否有任何其他优化的方法在 EMR 上执行 spark-submit，以便对所有可用资源进行最佳利用？ We run one job at once.我们一次运行一项工作。 There are no parallel jobs running on the cluster集群上没有运行并行作业

Answer 1

Without knowing the resources allocated per executor, the nature of the job, the volume of data you're processing etc., it's very hard to give a proper suggestion.如果不知道每个执行者分配的资源、工作的性质、你正在处理的数据量等，就很难给出正确的建议。 I think the best you can do now, is to also install ganglia while creating a EMR cluster.我认为您现在可以做的最好的事情是在创建 EMR 集群的同时安装 ganglia。 The ganglia web ui is available via http://master-public-dns-name/ganglia/ ganglia 网络用户界面可通过http://master-public-dns-name/ganglia/

Look at the cpu and memory usage, to start with.先看cpu和内存使用情况。 That will give you a good enough idea if you're optimally allocating resources for your spark job, and then tune resources per executor accordingly.如果您为 Spark 作业优化分配资源，然后相应地调整每个执行程序的资源，这将为您提供一个足够好的主意。

The number of executors, executor memory and cores can be set in your spark-submit command using the following way (these are sample values):可以使用以下方式在 spark-submit 命令中设置执行程序、执行程序内存和内核的数量（这些是示例值）：

--num-executors 10
--executor-cores 1
--executor-memory 5g

After looking at the ganglia charts you'll get a sense which resource is being under/over-utilized.查看神经节图表后，您将了解哪些资源被利用不足/过度利用。 Change those accordingly.相应地更改这些。

If you don't want to play around with these numbers and let spark decide what's the best combination, it might be worth setting dynamic resource allocation to true using the following lines:如果您不想玩弄这些数字并让 spark 决定最佳组合，那么使用以下几行将动态资源分配设置为 true 可能是值得的：

--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.enabled=true

One thing to note here is that yarn will get 75% of the total memory allocated to core + task nodes.这里需要注意的一件事是，yarn 将获得分配给核心 + 任务节点的总内存的 75%。 Also, the driver and each executor has a memory overhead associated to it.此外，驱动程序和每个执行程序都有与之相关的内存开销。 Look up the spark documentation for that.查找 spark 文档。 Keep that in mind while manually allocating resources to driver and executor(s).在手动为驱动程序和执行程序分配资源时请记住这一点。

Answer 2

The first step to analyse spark job is spark-ui .分析spark job第一步是spark-ui 。 so use the tracking url and look into the l ogs, jobs, executors, streaming .因此，请使用tracking url并查看ogs, jobs, executors, streaming 。

http://cluster_manager_host:8088/

For more detailed analysis of memory and cpu utilisation, You can use Gangalia tool too.如需更详细的内存和 CPU 利用率分析，您也可以使用Gangalia工具。

http://cluster_manager_host/Gangalia

After this what you can do is:在此之后，您可以做的是：

You have to go for custom configurations like您必须进行自定义配置，例如
(i) No of executors --conf num-executors x (i) 执行者--conf num-executors x
(ii) Executors memory --conf executor-memory y (ii) Executors memory --conf executor-memory y
(iii) No of cores -- conf executor-cores z (iii) 核心数-- conf executor-cores z
(iv) Enable dynamic resource allocation --conf spark.dynamicAllocation.enabled=true (iv) 启用动态资源分配--conf spark.dynamicAllocation.enabled=true
(v) Enable max resource allocation --conf maximizeResourceAllocation=true (v) 启用最大资源分配--conf maximizeResourceAllocation=true
(vi) Change the serialisation from default to Kryo , --conf org.apache.spark.serializer.KryoSerializer (vi) 将serialisation从默认更改为Kryo , --conf org.apache.spark.serializer.KryoSerializer
(vii) Change the num of partition from default to custom based on your configuration rdd=rdd.repatition(sparkConf.defaultParalallism*2) (vii) 根据您的配置将分区数量从默认更改为自定义rdd=rdd.repatition(sparkConf.defaultParalallism*2)
If after proper configuration of above your job is still slow then, Please Change your code and use proper functions and objects.如果在上面正确配置您的工作后仍然很慢，请更改您的代码并使用正确的函数和对象。 like喜欢
(i) If you are sending data to any external destination like Kinesis, DB or Kafka, Use mapPartitions or foreachPatitions and reduce no of object creations. (i) 如果您将数据发送到 Kinesis、DB 或 Kafka 等任何外部目的地，请使用 mapPartitions 或 foreachPatitions 并减少对象创建。
(ii) If you are calling external API then also follow the above strategy. (ii) 如果您正在调用外部 API，那么也请遵循上述策略。
(iii) Use of proper Data structures. (iii) 使用适当的数据结构。

For more Information, You can refer: here有关更多信息，您可以参考：这里

I hope this will help you.我希望这能帮到您。

EMR 上 Spark 批处理作业的优化

问题描述

2 个解决方案

解决方案1
0 2020-02-14 04:59:15

解决方案2
0 2020-02-14 05:29:18

EMR 上 Spark 批处理作业的优化

问题描述

2 个解决方案

解决方案1 0 2020-02-14 04:59:15

解决方案2 0 2020-02-14 05:29:18

解决方案1
0 2020-02-14 04:59:15

解决方案2
0 2020-02-14 05:29:18