简体   繁体   English

执行多个 Spark 作业

[英]Execute multiple Spark jobs

I am running a Spark job with following cluster and application configuration:我正在运行具有以下集群和应用程序配置的 Spark 作业:

Total Node: 3 Master Node Memory 7.5GB, 2 Cores Worker Node1, Memory 15GB, 4 Cores Worker Node2, Memory 15GB, 4 Cores总节点:3 个Master Node Memory 7.5GB, 2 Cores Worker Node1, Memory 15GB, 4 Cores Worker Node2, Memory 15GB, 4 Cores

Application Configuration:应用程序配置:

--master yarn --num-executors 2 --executor-cores 2 --executor-memory 2G

I am trying to submit multiple jobs in same time with same user, however I see only first two submitted jobs are executing and third has to wait with following warring.我正在尝试使用同一用户同时提交多个作业,但是我看到只有前两个提交的作业正在执行,第三个必须等待后续的交战。

19/11/19 08:30:49 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/11/19 08:30:49 WARN org.apache.spark.util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.

I found that SparkUI is being created for every submitted job, and my cluster is accepting only two job a time.我发现正在为每个提交的作业创建 SparkUI,而我的集群一次只接受两个作业。 Further I observed that it picked up the third job on port 4042 once first submitted job finished the execution.此外,我观察到,一旦第一个提交的作业完成执行,它就会在端口 4042 上获取第三个作业。 What could be wrong with my cluster that it is accepting only two job at a time?我的集群一次只接受两个工作可能有什么问题?

Here is my Spark Session Code:这是我的 Spark Session 代码:

val spark: SparkSession = {
  val sparkSession = SparkSession
    .builder()
    .appName("Data Analytics")
    .config("spark.scheduler.mode", "FAIR")
    //.master("local[*]")
    .getOrCreate()
  sparkSession
}

Further my questions are: Why SparkSession is creating SparkUI for each job and how we can solve this problem.我的进一步问题是:为什么 SparkSession 为每个作业创建 SparkUI 以及我们如何解决这个问题。 Is there any way to use same Session for multiple jobs.有没有办法将相同的 Session 用于多个工作。

There are several things that you have to take into account: Once you execute spark-submit a Spark application is created(client-mode) and a new driver is created and a new port is used for the driver console using the port 4040. That´s the reason of the warning, because you are trying to create another application and another driver, but the port 4040 is already used so it tries to use the 4041. A Spark job is not a Spark Application, is an execution that corresponds to a Spark action, so depending of the number of actions that your program executes the number of jobs that will be spawn.您必须考虑几件事情:一旦您执行 spark-submit,就会创建一个 Spark 应用程序(客户端模式)并创建一个新驱动程序,并使用端口 4040 为驱动程序控制台使用一个新端口。那这是警告的原因,因为您正在尝试创建另一个应用程序和另一个驱动程序,但端口 4040 已被使用,因此它尝试使用 4041。Spark 作业不是 Spark 应用程序,是对应于的执行一个 Spark 动作,因此取决于您的程序执行的动作数量,将产生的作业数量。

In your case you are trying to create two executors with two cores, in other words, you are trying to create two JVMs with two cores each, apart of the driver.在您的情况下,您正在尝试创建两个具有两个内核的执行程序,换句话说,您正在尝试创建两个具有两个内核的 JVM,除了驱动程序。 Because you are using Yarn it will try to provide the 4 cores for each of your applications and one for each driver.因为您使用的是 Yarn,它会尝试为您的每个应用程序提供 4 个内核,并为每个驱动程序提供一个内核。

For more info check this link: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-scheduler-ActiveJob.html有关更多信息,请查看此链接: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-scheduler-ActiveJob.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM