简体   繁体   English

Spark应用程序中的作业总数

[英]Total number of jobs in a Spark App

I already saw this question How to implement custom job listener/tracker in Spark? 我已经看到了这个问题如何在Spark中实现自定义作业监听器/跟踪器? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app? 并检查源代码以了解如何获得每个作业的阶段数,但有没有办法以编程方式跟踪在Spark应用程序中完成的作业的百分比?

I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run. 我可以通过听众获得已完成作业的数量,但我错过了将要运行的作业总数。

I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere. 我想跟踪整个应用程序的进度,它创建了很多工作,但我无法在任何地方找到它。

@Edit: I know there's a REST endpoint for getting all the jobs in an app but: @Edit:我知道有一个REST端点可以获取应用中的所有作业但是:

  1. I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it) 我宁愿不使用REST而是在应用程序本身中获取它(在AWS EMR / Yarn上运行的火花 - 获取地址可能是可行的,但我宁愿不这样做)
  2. that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs. REST端点似乎只返回正在运行/已完成/失败的作业,因此不返回作业总数。

After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start). 稍微查看一下源代码之后,我想没有办法预先看到有多少工作,因为我找不到任何Spark会提前做这样分析的地方(因为工作是在每个动作中独立提交的Spark都没有'从一开始就全面了解所有工作。

This kind of makes sense because of how Spark divides work into: 这种做法很有意义,因为Spark将工作划分为:

  • jobs - which are started whenever the code which is run on the driver node encounters an action (ie collect() , take() etc.) and are supposed to compute a value and return it to the driver 作业 - 每当在驱动程序节点上运行的代码遇到一个动作 (即collect()take()等)时启动,并且应该计算一个值并将其返回给驱动程序
  • stages - which are composed of sequences of tasks between which no data shuffling is required 阶段 - 由任务序列组成,在这些任务之间不需要数据混洗
  • tasks - computations of the same type which can run in parallel on worker nodes 任务 - 可以在工作节点上并行运行的相同类型的计算

So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go". 因此,我们需要事先了解单个作业的阶段和任务来创建DAG,但我们不一定需要创建DAG作业,我们可以“随时”创建它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM