简体   繁体   English

流启动后火花流新作业

[英]Spark Stream new Job after stream start

I have a situation where I am trying to stream using spark streaming from kafka. 我有一种情况,我尝试使用来自kafka的spark流进行流式处理。 The stream is a direct stream. 流是直接流。 I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming. 我能够创建一个流,然后开始流,也能够通过流在kafka上获取任何更新(如果有)。

The issue comes in when i have a new request to stream a new topic. 当我有一个流媒体新主题的新请求时,就会出现此问题。 Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request. 由于SparkStreaming上下文每个jvm只能为1,因此我无法为每个新请求创建一个新流。

The way I figured out is 我发现的方法是

  1. Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. 创建DStream并且火花流已经在进行中之后,只需将新流附加到该流即可。 This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. 这似乎不起作用,对于新的topic2,createDStream不返回流,并且停止了进一步的处理。 The streaming keep on continuing on the first request (say topic1). 流式处理继续在第一个请求上继续(例如topic1)。

  2. Second, I thought to stop the stream, create DStream and then start streaming again. 其次,我想停止流,创建DStream,然后再次开始流。 I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one. 我不能使用相同的流上下文(流停止后将无法添加作业),并且如果我为新主题(topic2)创建新流,则旧流主题(topic1)丢失并且流只有新的。

Here is the code, have a look 这是代码,看看

 JavaStreamingContext javaStreamingContext;
        if(null == javaStreamingContext) {
            javaStreamingContext =  JavaStreamingContext(sparkContext, Durations.seconds(duration));
        } else {
            StreamingContextState streamingContextState = javaStreamingContext.getState();
            if(streamingContextState == StreamingContextState.STOPPED) {
                javaStreamingContext =  JavaStreamingContext(sparkContext, Durations.seconds(duration));
            }


        }
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
        SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());

        KafkaUtils.createDirectStream(javaStreamingContext,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
                .map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
                .foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {

            javaStreamingContext.start();
            javaStreamingContext.awaitTermination();
        }

Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction. 不用担心SparkVoidFunctionImpl,这是一个自定义类,是VoidFunction的实现。

The above is approach 1, where i do not stop the existing streaming. 以上是方法1,其中我不停止现有的流式传输。 When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. 当新请求进入此方法时,它不会获得新的流对象,而是尝试创建dstream。 The issue is the DStream object is never returned. 问题是永远不会返回DStream对象。

KafkaUtils.createDirectStream(javaStreamingContext,
                    LocationStrategies.PreferConsistent(),
                    ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))

This does not return a dstream, the control just terminates without an error.The steps further are not executed. 这不会返回dstream,控件会立即终止而不会出现错误。进一步的步骤将不会执行。

I have tried many things and read multiple article, but I belive this is a very common production level issue. 我已经尝试了很多事情并阅读了多篇文章,但是我相信这是一个非常常见的生产级别问题。 Any streaming done is to be done on multiple different topics and each of them is handled differently. 任何流式处理都将在多个不同的主题上进行,并且每个主题的处理方式都不同。

Please help 请帮忙

The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted. 问题是spark master将代码发送给工作人员,尽管数据正在流式传输,但除非重新启动作业,否则底层代码和变量值将保持静态。

Few options I could think: 我想不到的几种选择:

  1. Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. Spark Job服务器:每次您要从其他主题进行订阅/流式传输而不是触摸已经在运行的作业时,就开始一个新作业。 From your API body you can supply the parameters or topic name. 您可以从API主体中提供参数或主题名称。 If you want to stop streaming from a specific topic, just stop respective job. 如果要停止特定主题的流式传输,只需停止相应的作业即可。 It will give you a lot of flexibility and control on resources. 它将为您提供很大的灵活性和对资源的控制。

  2. [Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. [理论]主题过滤器:订阅您认为想要的所有主题,当记录被拉长一段时间后,将根据主题列表过滤出记录。 Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. 通过API处理此主题列表以增加或减小主题范围,它也可以是广播变量。 This is just an idea, I have not tried this option at all. 这只是一个想法,我根本没有尝试过此选项。

  3. Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to. 另一个解决方法是,在需要时使用微服务将Topic-2数据中继到Topic-1,并在不需要时停止。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM