Logging Spark Jobs

Question

I'm trying to keep track of the jobs that are being submitted on a cluster and so far have only found logging solutions for event logs using spark.eventLog.enabled = True which provides information on when tasks start and finish (more info on that here ) or log4j which also provides information on task status and progress.

What I really want though is to log the task that is run. So this would capture the code that was executed like: var = sc.range(1000) or min_var = var.min(). From what I have seen the other loggers explained above cannot do this.

As an example, if I ran the two commands above( var = sc.range(1000) and min_var = var.min() ) I would want to see something like the following in a log4j-like logger:

INFO RUNNING var = sc.range(1000)

INFO RUNNING min_var = var.min()

Has anyone ran across a logger like this?

Answer 1

If you are running on YARN and yarn.log-aggregation-enable is set to true you could do:

yarn logs --applicationId <application-id>

and retrieve logs for any finished application, including the logs that were generated by your code.

UPDATE:

Unfortunately there is no (at least popular) library that lets you do that (logging your in-between-stage-boundaries code), but for the rest you could exploit the Spark driver's logs to get something more educational than what you have now. First ensure you get the most out of the logs by setting log4j to DEBUG level (create from template and edit conf/log4j.properties ) as follows:

log4j.rootCategory=DEBUG, console

Then having the logs do some filtering (specially on existent driver's logs). For instance for the job:

user@laptop:~$ cd ~/opt/spark
user@laptop:~/opt/spark$ git clone https://github.com/ehiggs/spark-terasort.git
user@laptop:~/opt/spark$ cd spark-terasort
user@laptop:~/opt/spark/spark-terasort$ mvn package
...
user@laptop:~/opt/spark/spark-terasort$ cd .. 
user@laptop:~/opt/spark$ ./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort spark-terasort/target/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar ~/data/terasort_in ~/data/terasort_out &> logs.logs

Then you could do:

user@laptop:~/opt/spark$ cat logs.logs | grep "Registering RDD\|Final stage\|Job" | grep DAG

and get something like:

16/03/24 23:47:44 INFO DAGScheduler: Registering RDD 0 (newAPIHadoopFile at TeraSort.scala:60)
16/03/24 23:47:44 INFO DAGScheduler: Final stage: ResultStage 1 (sortByKey at TeraSort.scala:61)
16/03/24 23:48:41 INFO DAGScheduler: Job 0 finished: sortByKey at TeraSort.scala:61, took 56.468248 s
16/03/24 23:48:41 INFO DAGScheduler: Registering RDD 1 (partitionBy at TeraSort.scala:61)
16/03/24 23:48:41 INFO DAGScheduler: Final stage: ResultStage 4 (saveAsNewAPIHadoopFile at TeraSort.scala:62)
16/03/24 23:50:35 INFO DAGScheduler: Job 1 finished: saveAsNewAPIHadoopFile at TeraSort.scala:62, took 114.042019 s

Note that narrow transformations that are not the tail of a phase will not be listed. Here, rather than adding logging calls to your code, you could enrich RDD's names making use of the following trick:

rdd.setName("more interesting info, or even the algorithm itself")

and get it displayed in the Spark logs themselves as a guide.

Hope this gives you some ideas to get closer to what you expect.

Logging Spark Jobs

Question

1 answers

solution1
3 2016-03-23 18:20:56

Logging Spark Jobs

Question

1 answers

solution1 3 2016-03-23 18:20:56

solution1
3 2016-03-23 18:20:56