简体   繁体   English

Spark作业没有在群集上获得足够的容器

[英]Spark job not getting Enough Containers on cluster

I have a spark application. 我有一个火花应用程序。 Which is reading data from oracle into data-frames. 这是将数据从oracle读取到数据帧中。 then i am converting it into javaRDD and savingAsTExt to hdfs. 然后我将其转换为javaRDD并将saveTEExt保存为hdfs。 I am running this on yarn on 8 node cluster. 我在8节点群集上的yarn上运行它。 When i see the job on spark-webUI. 当我在spark-webUI上看到工作时。 I can see it is getting only 2 containers and 2 cpus. 我可以看到它仅获得2个容器和2个cpus。

I am reading 5 tables from oracle. 我正在从Oracle读取5个表。 Each table is having around 500 millions of rows. 每个表都有大约5亿行。 Data size is about of 80GB. 数据大小约为80GB。

spark-submit  --class "oracle.table.join.JoinRdbmsTables"  --master yarn --deploy-mode cluster  oracleData.jar

Also i used: 我也用过:

spark-submit --class "oracle.table.join.JoinRdbmsTables" --master yarn --deploy-mode cluster --num-executors 40 oracleDataWrite.jar spark-submit-类“ oracle.table.join.JoinRdbmsTables” --master yarn-部署模式集群--num-executors 40 oracleDataWrite.jar

I could see 40 containers get assigned to job. 我可以看到有40个容器分配给工作。 However, I could only see 1 active task on web-ui. 但是,我只能在web-ui上看到1个活动任务

I have another spark application. 我还有另一个Spark应用程序。 Which is loading a 20GB text file, then i am doing some processing on data and saving to hdfs. 这正在加载20GB的文本文件,然后我正在对数据进行一些处理并将其保存到hdfs。 I can see it is getting assigned with around 64 containers and cpus. 我可以看到它被分配了约64个容器和cpus。

spark-submit  --class "practice.FilterSave"  --master yarn --deploy-mode cluster  batch-spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar mergedData.json

The difference between them is::-->> for second application i am using sparkJavaContext while for first i am using SQLContext to use data-frame. 它们之间的区别是:对于第二个应用程序,我使用sparkJavaContext,而对于第一个应用程序,我使用SQLContext使用数据帧。

NOTE: I AM NOT GETTNG ANY-ERROR FOR BOTH. 注意:我都不会犯任何错误。

Here is the piece of code i am using to load 5 table 这是我用来加载5表的代码

Map<String, String> options = new HashMap();
options.put("driver", "oracle.jdbc.driver.OracleDriver");
options.put("url", "XXXXXXX");
options.put("dbtable", "QLRCR2.table1");
DataFrame df=sqlcontext.load("jdbc", options);
//df.show();
JavaRDD<Row> rdd=df.javaRDD();
rdd.saveAsTextFile("hdfs://path");

Map<String, String> options2 = new HashMap();
options2.put("driver", "oracle.jdbc.driver.OracleDriver");
options2.put("url", "XXXXXXX");
options2.put("dbtable", "QLRCR2.table2");
DataFrame df2=sqlcontext.load("jdbc", options);
//df2.show();
JavaRDD<Row> rdd2=df2.javaRDD();
rdd2.saveAsTextFile("hdfs://path"); 

ANy help will be appreciated :) 任何帮助将不胜感激 :)

The number of executors when running on yarn is set by setting --num-executors N. Note that this does NOT mean you will get N executors, only that N will be requested from Yarn. 通过设置--num-executors N来设置在yarn上运行时执行程序的数量。注意,这并不意味着您将获得N个执行程序,仅是从Yarn请求N。 The amount you can actually get depends on the amount of resources you request per executor. 实际可获得的数量取决于每个执行者要求的资源数量。 For example, if each node has 25GB dedicated to Yarn ( yarn-site.xml yarn.nodemanager.resource.memory-mb ) and you have 8 nodes, and no other application is running on Yarn, it makes sense to request 8 executors with ~20GB. 例如,如果每个节点都有25GB专用于Yarn(yarn-site.xml yarn.nodemanager.resource.memory-mb),并且您有8个节点,并且Yarn上没有其他应用程序在运行,则请求8个执行者使用约20GB。 Notice that on top of what you request with --executor-memory, Spark adds an overhead of 10% ( the default ) so you can't ask for the whole 25GB. 请注意,除了使用--executor-memory请求的内容外,Spark还增加了10%的开销(默认),因此您无法要求整个25GB。 More or less similar is --execturo-cores ( yarn-site.xml yarn.nodemanager.resource.cpu-vcores ). --execturo-cores(yarn-site.xml yarn.nodemanager.resource.cpu-vcores)或多或少相似。

The second question regarding the amount of tasks is a separate thing, check out this good explanation on how stages are split into tasks 关于任务数量的第二个问题是一回事,请查看有关如何将阶段拆分为任务的良好解释

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM