[英]Spark job not getting Enough Containers on cluster
I have a spark application. 我有一个火花应用程序。 Which is reading data from oracle into data-frames.
这是将数据从oracle读取到数据帧中。 then i am converting it into javaRDD and savingAsTExt to hdfs.
然后我将其转换为javaRDD并将saveTEExt保存为hdfs。 I am running this on yarn on 8 node cluster.
我在8节点群集上的yarn上运行它。 When i see the job on spark-webUI.
当我在spark-webUI上看到工作时。 I can see it is getting only 2 containers and 2 cpus.
我可以看到它仅获得2个容器和2个cpus。
I am reading 5 tables from oracle. 我正在从Oracle读取5个表。 Each table is having around 500 millions of rows.
每个表都有大约5亿行。 Data size is about of 80GB.
数据大小约为80GB。
spark-submit --class "oracle.table.join.JoinRdbmsTables" --master yarn --deploy-mode cluster oracleData.jar
Also i used: 我也用过:
spark-submit --class "oracle.table.join.JoinRdbmsTables" --master yarn --deploy-mode cluster --num-executors 40 oracleDataWrite.jar
spark-submit-类“ oracle.table.join.JoinRdbmsTables” --master yarn-部署模式集群--num-executors 40 oracleDataWrite.jar
I could see 40 containers get assigned to job. 我可以看到有40个容器分配给工作。 However, I could only see 1 active task on web-ui.
但是,我只能在web-ui上看到1个活动任务 。
I have another spark application. 我还有另一个Spark应用程序。 Which is loading a 20GB text file, then i am doing some processing on data and saving to hdfs.
这正在加载20GB的文本文件,然后我正在对数据进行一些处理并将其保存到hdfs。 I can see it is getting assigned with around 64 containers and cpus.
我可以看到它被分配了约64个容器和cpus。
spark-submit --class "practice.FilterSave" --master yarn --deploy-mode cluster batch-spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar mergedData.json
The difference between them is::-->> for second application i am using sparkJavaContext while for first i am using SQLContext to use data-frame. 它们之间的区别是:对于第二个应用程序,我使用sparkJavaContext,而对于第一个应用程序,我使用SQLContext使用数据帧。
NOTE: I AM NOT GETTNG ANY-ERROR FOR BOTH. 注意:我都不会犯任何错误。
Here is the piece of code i am using to load 5 table 这是我用来加载5表的代码
Map<String, String> options = new HashMap();
options.put("driver", "oracle.jdbc.driver.OracleDriver");
options.put("url", "XXXXXXX");
options.put("dbtable", "QLRCR2.table1");
DataFrame df=sqlcontext.load("jdbc", options);
//df.show();
JavaRDD<Row> rdd=df.javaRDD();
rdd.saveAsTextFile("hdfs://path");
Map<String, String> options2 = new HashMap();
options2.put("driver", "oracle.jdbc.driver.OracleDriver");
options2.put("url", "XXXXXXX");
options2.put("dbtable", "QLRCR2.table2");
DataFrame df2=sqlcontext.load("jdbc", options);
//df2.show();
JavaRDD<Row> rdd2=df2.javaRDD();
rdd2.saveAsTextFile("hdfs://path");
ANy help will be appreciated :) 任何帮助将不胜感激 :)
The number of executors when running on yarn is set by setting --num-executors N. Note that this does NOT mean you will get N executors, only that N will be requested from Yarn. 通过设置--num-executors N来设置在yarn上运行时执行程序的数量。注意,这并不意味着您将获得N个执行程序,仅是从Yarn请求N。 The amount you can actually get depends on the amount of resources you request per executor.
实际可获得的数量取决于每个执行者要求的资源数量。 For example, if each node has 25GB dedicated to Yarn ( yarn-site.xml yarn.nodemanager.resource.memory-mb ) and you have 8 nodes, and no other application is running on Yarn, it makes sense to request 8 executors with ~20GB.
例如,如果每个节点都有25GB专用于Yarn(yarn-site.xml yarn.nodemanager.resource.memory-mb),并且您有8个节点,并且Yarn上没有其他应用程序在运行,则请求8个执行者使用约20GB。 Notice that on top of what you request with --executor-memory, Spark adds an overhead of 10% ( the default ) so you can't ask for the whole 25GB.
请注意,除了使用--executor-memory请求的内容外,Spark还增加了10%的开销(默认),因此您无法要求整个25GB。 More or less similar is --execturo-cores ( yarn-site.xml yarn.nodemanager.resource.cpu-vcores ).
--execturo-cores(yarn-site.xml yarn.nodemanager.resource.cpu-vcores)或多或少相似。
The second question regarding the amount of tasks is a separate thing, check out this good explanation on how stages are split into tasks 关于任务数量的第二个问题是一回事,请查看有关如何将阶段拆分为任务的良好解释
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.