简体繁体 English

如何在不运行的情况下获取Apache Spark作业的DAG？

[英]How can I obtain the DAG of an Apache Spark job without running it?

原文 2017-09-16 13:34:52 8 1 scala/ apache-spark

I have some Scala code that I can run with Spark using spark-submit. 我有一些Scala代码，我可以使用spark-submit与Spark一起运行。 From what I understood, Spark creates a DAG in order to schedule the operation. 根据我的理解，Spark创建了一个DAG以便安排操作。

Is there a way to retrieve this DAG without actually performing the heavy operations, eg just by analyzing the code ? 有没有办法检索这个DAG而不实际执行繁重的操作，例如只是通过分析代码？

I would like a useful representation such as a data structure or at least a written representation, not the DAG visualization. 我想要一个有用的表示，如数据结构或至少一个书面表示，而不是DAG可视化。

1 个解决方案

If you are using dataframes (spark sql) you can use df.explain(true) to get the plan and all operations (before and after optimization). 如果您使用的是数据帧（spark sql），则可以使用df.explain（true）来获取计划和所有操作（优化前后）。

If you are using rdd you can use rdd.toDebugString to get a string representation and rdd.dependencies to get the tree itself. 如果您使用的是rdd，则可以使用rdd.toDebugString来获取字符串表示形式，使用rdd.dependencies来获取树本身。

If you use these without the actual action you would get a representation of what is going to happen without actually doing the heavy lifting. 如果您在没有实际操作的情况下使用它们，您将获得在没有实际执行繁重任务的情况下将会发生什么的表示。

如何在Apache Spark中缓存可被其他Spark作业使用的数据 - how to cache data in apache spark that can be used by other spark job

在运行于纱线中的scala spark作业中，如何使作业失败，以便纱线显示“失败”状态 - In a scala spark job, running in yarn, how can I fail the job so that yarn shows a Failed status

如何知道Apache Spark中当前正在运行的作业的哪个阶段？ - How to know which stage of a job is currently running in Apache Spark?

Apache Spark：如何取消代码中的作业并终止正在运行的任务？ - Apache Spark: how to cancel job in code and kill running tasks?

在 Apache Spark 上的 Scala 作业中没有这样的方法运行 forEach - No such method running forEach in Scala job on Apache Spark

如何以编程方式运行Spark作业 - How can I run Spark job programmatically

如何将配置文件添加到在 YARN-CLUSTER 模式下运行的 Spark 作业？ - How can I add configuration files to a Spark job running in YARN-CLUSTER mode?

Spark作业长时间无结果运行 - Spark job running without result for long

Scala - 在不执行的情况下获取具有阶段和任务的 DAG - Scala - Obtain DAG with stages and tasks without execution

在Apache Spark中跨多个工人运行一项工作 - running a single job across multiple workers in apache spark

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Apache Spark中缓存可被其他Spark作业使用的数据 - how to cache data in apache spark that can be used by other spark job 在运行于纱线中的scala spark作业中，如何使作业失败，以便纱线显示“失败”状态 - In a scala spark job, running in yarn, how can I fail the job so that yarn shows a Failed status 如何知道Apache Spark中当前正在运行的作业的哪个阶段？ - How to know which stage of a job is currently running in Apache Spark? Apache Spark：如何取消代码中的作业并终止正在运行的任务？ - Apache Spark: how to cancel job in code and kill running tasks? 在 Apache Spark 上的 Scala 作业中没有这样的方法运行 forEach - No such method running forEach in Scala job on Apache Spark 如何以编程方式运行Spark作业 - How can I run Spark job programmatically 如何将配置文件添加到在 YARN-CLUSTER 模式下运行的 Spark 作业？ - How can I add configuration files to a Spark job running in YARN-CLUSTER mode? Spark作业长时间无结果运行 - Spark job running without result for long Scala - 在不执行的情况下获取具有阶段和任务的 DAG - Scala - Obtain DAG with stages and tasks without execution 在Apache Spark中跨多个工人运行一项工作 - running a single job across multiple workers in apache spark

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM