简体   繁体   English

集群部署模式下的spark-submit将应用程序ID提供给控制台

[英]spark-submit in cluster deploy mode get application id to console

I am stuck in one problem which I need to resolve quickly. 我陷入了一个需要快速解决的问题。 I have gone through many posts and tutorial about spark cluster deploy mode, but I am clueless about the approach as I am stuck for some days. 我已经阅读了很多关于spark集群部署模式的帖子和教程,但由于我被困了几天,我对这种方法一无所知。

My use-case :- I have lots of spark jobs submitted using 'spark2-submit' command and I need to get the application id printed in the console once they are submitted. 我的用例: - 我有很多使用'spark2-submit'命令提交的spark作业,我需要在提交后在控制台中打印应用程序ID。 The spark jobs are submitted using cluster deploy mode. 使用集群部署模式提交spark作业。 ( In normal client mode , its getting printed ) (在正常的客户端模式下,它被打印)

Points I need to consider while creating solution :- I am not supposed to change code ( as it would take long time, cause there are many applications running ), I can only provide log4j properties or some custom coding. 在创建解决方案时我需要考虑的要点: - 我不应该更改代码(因为它需要很长时间,因为有许多应用程序正在运行),我只能提供log4j属性或一些自定义编码。

My approach:- 我的方法: -

1) I have tried changing the log4j levels and various log4j parameters but the logging still goes to the centralized log directory. 1)我尝试更改log4j级别和各种log4j参数,但日志记录仍然进入集中式日志目录。

Part from my log4j.properties:- 部分来自我的log4j.properties :-

log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=ALL,console
log4j.appender.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.Target=System.out

log4j.logger.org.apache.spark.deploy.SparkSubmit=ALL
log4j.appender.org.apache.spark.deploy.SparkSubmit=console

log4j.logger.org.apache.spark.deploy.SparkSubmit=TRACE,console
log4j.additivity.org.apache.spark.deploy.SparkSubmit=false

log4j.logger.org.apache.spark.deploy.yarn.Client=ALL
log4j.appender.org.apache.spark.deploy.yarn.Client=console


log4j.logger.org.apache.spark.SparkContext=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=INFO,console

log4j.logger.org.apache.hadoop.ipc.Client=ALL

2) I have also tried to add custom listener and I am able to get the spark application id after the applications finishes , but not to console. 2)我还尝试添加自定义侦听器,我可以在应用程序完成后获取spark应用程序ID,但不能控制台。

Code logic :- 代码逻辑: -

public void onApplicationEnd(SparkListenerApplicationEnd arg0) 
    {
         for (Thread t : Thread.getAllStackTraces().keySet()) 
         {
            if (t.getName().equals("main"))
            {
                System.out.println("The current state : "+t.getState());

                Configuration config = new Configuration();

                ApplicationId appId = ConverterUtils.toApplicationId(getjobUId);

                // some logic to write to communicate with the main thread to print the app id to console.
            }
         }
    }

3) I have enabled the spark.eventLog to true and specified a directory in HDFS to write the event logs from spark-submit command . 3)我已将spark.eventLog启用为true并在HDFS中指定了一个目录,以便从spark-submit命令中写入事件日志。

If anyone could help me in finding an approach to the solution, it would be really helpful. 如果有人能帮助我找到解决方案的方法,那将非常有帮助。 Or if I am doing something very wrong, any insights would help me. 或者,如果我做错了什么,任何见解都会对我有所帮助。

Thanks. 谢谢。

After being stuck at the same place for some days, I was finally able to get a solution to my problem. 在被困在同一个地方几天之后,我终于能够解决我的问题。

After going through the Spark Code for the cluster deploy mode and some blogs, few things got clear. 在完成集群部署模式和一些博客的Spark代码之后,很少有事情变得清晰。 It might help someone else looking to achieve the same result. 它可能会帮助其他人寻求获得相同的结果。

In cluster deploy mode, the job is submitted via a Client thread from the machine from which the user is submitting. 在集群部署模式下,作业通过客户端线程从用户提交的计算机提交。 Actually I was passing the log4j configs to the driver and executors, but missed out on the part that the log 4j configs for the "Client" was missing. 实际上我正在将log4j配置传递给驱动程序和执行程序,但错过了日志4j配置为“客户端”的部分。

So we need to use :- 所以我们需要使用: -

SPARK_SUBMIT_OPTS="-Dlog4j.debug=true -Dlog4j.configuration= <location> /log4j.properties" spark-submit <rest of the parameters> SPARK_SUBMIT_OPTS =“ - Dlog4j.debug = true -Dlog4j.configuration = <location> /log4j.properties”spark-submit <rest of the parameters>

To clarify: 澄清:

  1. client mode means the Spark driver is running on the same machine you ran spark submit from client mode意味着Spark驱动程序在您运行spark提交的同一台机器上运行
  2. cluster mode means the Spark driver is running out on the cluster somewhere cluster mode意味着Spark驱动程序正在某个集群上运行

You mentioned that it is getting logged when you run the app in client mode and you can see it in the console. 您提到在客户端模式下运行应用程序时会记录它,您可以在控制台中看到它。 Your output is also getting logged when you run in cluster mode you just can't see it because it is running on a different machine. 当您在群集模式下运行时,您的输出也会被记录因为它在另一台计算机上运行,​​您无法看到它

Some ideas: 一些想法:

  • Aggregate the logs from the worker nodes into one place where you can parse them to get the app ID. 将工作节点中的日志聚合到一个可以解析它们以获取应用程序ID的位置。
  • Write the appIDs to some shared location like HDFS or a database. 将appID写入某些共享位置,如HDFS或数据库。 You might be able to use a Log4j appender if you want to keep log4j. 如果要保留log4j,则可以使用Log4j appender

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM