[英]How to know which stage of a job is currently running in Apache Spark?
Consider I have a job as follow in Spark; 考虑一下我在Spark有一份工作;
CSV File ==> Filter By A Column ==> Taking Sample ==> Save As JSON CSV文件 ==> 按列过滤 ==> 取样 ==> 另存为JSON
Now my requirement is how do I know which step( Fetching file or Filtering or Sampling ) of the job is currently executing programatically (Preferably using Java API)? 现在我的要求是如何知道作业当前正在以编程方式执行哪个步骤( 获取文件或过滤或采样 )(最好使用Java API)? Is there any way for this? 这有什么办法吗?
I can track Job,Stage and Task using SparkListener class. 我可以使用SparkListener类跟踪Job,Stage和Task。 And it can be done like tracking a stage Id. 它可以像跟踪阶段ID一样完成。 But how to know which stage Id is for which step in the job chain. 但是如何知道哪个阶段Id适用于工作链中的哪一步。
What I want to send a notification to user when consider Filter By A Column is completed. 考虑按列过滤完成后,我想向用户发送通知。 For that I made a class that extends SparkListener class. 为此,我创建了一个扩展SparkListener类的类。 But I can not find out from where I can get the name of currently executing transformation name. 但我无法从中找到当前正在执行的转换名称的名称。 Is it possible to track at all? 有可能跟踪吗?
public class ProgressListener extends SparkListener{
@Override
public void onJobStart(SparkListenerJobStart jobStart)
{
}
@Override
public void onStageSubmitted(SparkListenerStageSubmitted stageSubmitted)
{
//System.out.println("Stage Name : "+stageSubmitted.stageInfo().getStatusString()); giving action name only
}
@Override
public void onTaskStart(SparkListenerTaskStart taskStart)
{
//no such method like taskStart.name()
}
}
You cannot exactly know when, eg, the filter operation starts or finishes. 您无法准确知道何时,例如,过滤器操作开始或结束。
That's because you have transformations ( filter
, map
,...) and actions ( count
, foreach
,...). 那是因为你有转换( filter
, map
,...)和动作( count
, foreach
,......)。 Spark will put as many operations into one stage as possible. Spark会将尽可能多的操作放入一个阶段。 Then the stage is executed in parallel on the different partitions of your input. 然后在输入的不同分区上并行执行阶段。 And here comes the problem. 这就是问题所在。
Assume you have several workers and the following program 假设你有几个工人和以下程序
LOAD ==> MAP ==> FILTER ==> GROUP BY + Aggregation LOAD ==> MAP ==> FILTER ==> GROUP BY + Aggregation
This program will probably have two stages: the first stage will load the file and apply the map
and filter
. 该程序可能有两个阶段:第一阶段将加载文件并应用map
和filter
。 Then the output will be shuffled to create the groups. 然后输出将被洗牌以创建组。 In the second stage the aggregation will be performed. 在第二阶段,将执行聚合。
Now, the problem is, that you have several workers and each will process a portion of your input data in parallel. 现在,问题是,你有几个工人,每个工人将并行处理一部分输入数据。 That is, every executor in your cluster will receive a copy of your program(the current stage) and execute this on the assigned partition. 也就是说,群集中的每个执行程序都将收到程序的副本(当前阶段)并在指定的分区上执行此操作。
You see, you will have multiple instances of your map
and filter
operators that are executed in parallel, but not necessarily at the same time. 您会看到,您将拥有多个并行执行的map
和filter
运算符实例,但不一定同时执行。 In an extreme case, worker 1 will finish with stage 1 before worker 20 has started at all (and therefore finish with its filter
operation before worker 20). 在极端情况下,工人1将在工人20完全开始之前完成阶段1(因此在工人20之前完成其filter
操作)。
For RDDs Spark uses the iterator model inside a stage. 对于RDD,Spark在阶段内使用迭代器模型 。 For Datasets in the latest Spark version however, they create a single loop over the partition and execute the transformations. 但是,对于最新Spark版本中的数据集,它们会在分区上创建一个循环并执行转换。 This means that in this case Spark itself does not really know when a transformation operator finished for a single task! 这意味着在这种情况下,Spark本身并不真正知道转换运算符何时完成单个任务!
Long story short: 长话短说:
So, now I already had the same problem: 所以,现在我已经遇到了同样的问题:
In our Piglet project (please allow some adverstisement ;-) ) we generate Spark code from Pig Latin scripts and wanted to profile the scripts. 在我们的Piglet项目中 (请允许一些adverstisement ;-))我们从Pig Latin脚本生成Spark代码并想要分析脚本。 I ended up in inserting mapPartition
operator between all user operators that will send the partition ID and the current time to a server which will evaluate the messages. 我最终在所有用户操作员之间插入mapPartition
操作符,这些操作员将分区ID和当前时间发送到将评估消息的服务器。 However, this solution also has its limitations... and I'm not completely satisfied yet. 但是,这个解决方案也有其局限性......我还没有完全满意。
However, unless you are able to modify the programs I'm afraid you cannot achieve what you want. 但是,除非你能够修改程序,否则你恐怕无法实现你想要的。
Did you consider this option: http://spark.apache.org/docs/latest/monitoring.html 你有没有考虑过这个选项: http : //spark.apache.org/docs/latest/monitoring.html
It seems you can use the following rest api to get a certain job state /applications/[app-id]/jobs/[job-id] 看来你可以使用下面的rest api来获得某个工作状态/ applications / [app-id] / jobs / [job-id]
You can set the JobGroupId and JobGroupDescription so you can track what job group is being handled. 您可以设置JobGroupId和JobGroupDescription,以便跟踪正在处理的作业组。 ie setJobGroup 即setJobGroup
Assuming you'll call the JobGroupId "test" 假设您将调用JobGroupId“test”
sc.setJobGroup("1", "Test job")
When you'll call the http://localhost:4040/api/v1/applications/[app-id]/jobs/[job-id] 当你打电话给http:// localhost:4040 / api / v1 / applications / [app-id] / jobs / [job-id]
You'll get a json with a descriptive name for that job: 您将获得一个带有该作业描述性名称的json:
{
"jobId" : 3,
"name" : "count at <console>:25",
"description" : "Test Job",
"submissionTime" : "2017-02-22T05:52:03.145GMT",
"completionTime" : "2017-02-22T05:52:13.429GMT",
"stageIds" : [ 3 ],
"jobGroup" : "1",
"status" : "SUCCEEDED",
"numTasks" : 4,
"numActiveTasks" : 0,
"numCompletedTasks" : 4,
"numSkippedTasks" : 0,
"numFailedTasks" : 0,
"numActiveStages" : 0,
"numCompletedStages" : 1,
"numSkippedStages" : 0,
"numFailedStages" : 0
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.