简体   繁体   English

如何将jar附加到正在执行作业的Spark集群上?

[英]How to attach a jar to the spark cluster that is executing the job?

Spark streaming is really awesome. Spark流真的很棒。 But when I use it, I encountered an issue. 但是当我使用它时,我遇到了一个问题。

Scenario: I use the Spark Streaming to consume the message from Kafka. 场景:我使用Spark Streaming来消耗来自Kafka的消息。 currently there are two topics and I hard code them. 当前有两个主题,我将它们硬编码。 But it's not good for extensibility. 但这对可扩展性不好。

For example, if there is a new topic, I need to define a scala class for the parquet schema. 例如,如果有一个新主题,我需要为实木复合地板架构定义一个scala类。 then stop the running spark and start the spark again. 然后停止运行中的火花并再次启动火花。

What I'm expecting is that spark is still running, I can add the new jar library and notify spark to load the new class in the new jar. 我期望的是spark仍在运行,我可以添加新的jar库,并通知spark将新类加载到新的jar中。 thus spark can consume the new topic message and write the related parquet to HDFS. 因此spark可以使用新的主题消息并将相关的实木复合地板写入HDFS。

It's appreciated that you can give me some suggestions about this. 非常感谢您可以给我一些建议。 I searched for dynamically loading, but the question is how to attach the new jar to the existing running spark without stopping it. 我搜索了动态加载,但问题是如何在不停止的情况下将新jar连接到现有的运行中的spark。

Thank you in advance. 先感谢您。

Metadata is an ideal solution for your case. 元数据是您的案例的理想解决方案。 You need to maintain a metadata service, which is consumed by the spark streaming application as a reference for its consumers. 您需要维护一个元数据服务,该数据流由火花流应用程序使用,以作为其使用者的参考。

Something like this exposed over a REST API - 通过REST API公开的类似内容-

{
topicName: {},
schema: {},
outputPath:
}

And add a trigger from custom SparkListener implementation. 并添加来自自定义SparkListener实现的触发器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM