流启动后火花流新作业

Question

I have a situation where I am trying to stream using spark streaming from kafka. 我有一种情况，我尝试使用来自kafka的spark流进行流式处理。 The stream is a direct stream. 流是直接流。 I am able to create a stream and then start streaming, also able to get any updates (if any) on kafka via the streaming. 我能够创建一个流，然后开始流，也能够通过流在kafka上获取任何更新（如果有）。

The issue comes in when i have a new request to stream a new topic. 当我有一个流媒体新主题的新请求时，就会出现此问题。 Since SparkStreaming context can be only 1 per jvm, I cannot create a new stream for every new request. 由于SparkStreaming上下文每个jvm只能为1，因此我无法为每个新请求创建一个新流。

The way I figured out is 我发现的方法是

Once a DStream is created and spark streaming is already in progress, just attach a new stream to it. 创建DStream并且火花流已经在进行中之后，只需将新流附加到该流即可。 This does not seem to work, the createDStream (for a new topic2) does not return a stream and further processing is stopped. 这似乎不起作用，对于新的topic2，createDStream不返回流，并且停止了进一步的处理。 The streaming keep on continuing on the first request (say topic1). 流式处理继续在第一个请求上继续（例如topic1）。
Second, I thought to stop the stream, create DStream and then start streaming again. 其次，我想停止流，创建DStream，然后再次开始流。 I cannot use the same streaming context (it throws an excpection that jobs cannot be added after streaming has been stopped), and if I create a new stream for new topic (topic2), the old stream topic (topic1) is lost and it streams only the new one. 我不能使用相同的流上下文（流停止后将无法添加作业），并且如果我为新主题（topic2）创建新流，则旧流主题（topic1）丢失并且流只有新的。

Here is the code, have a look 这是代码，看看

 JavaStreamingContext javaStreamingContext;
        if(null == javaStreamingContext) {
            javaStreamingContext =  JavaStreamingContext(sparkContext, Durations.seconds(duration));
        } else {
            StreamingContextState streamingContextState = javaStreamingContext.getState();
            if(streamingContextState == StreamingContextState.STOPPED) {
                javaStreamingContext =  JavaStreamingContext(sparkContext, Durations.seconds(duration));
            }


        }
Collection<String> topics = Arrays.asList(getTopicName(schemaName));
        SparkVoidFunctionImpl impl = new SparkVoidFunctionImpl(getSparkSession());

        KafkaUtils.createDirectStream(javaStreamingContext,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))
                .map((stringStringConsumerRecord) -> stringStringConsumerRecord.value())
                .foreachRDD(impl);
if (javaStreamingContext.getState() == StreamingContextState.ACTIVE) {

            javaStreamingContext.start();
            javaStreamingContext.awaitTermination();
        }

Don't worry about SparkVoidFunctionImpl, this is a custom class with is the implementation of VoidFunction. 不用担心SparkVoidFunctionImpl，这是一个自定义类，是VoidFunction的实现。

The above is approach 1, where i do not stop the existing streaming. 以上是方法1，其中我不停止现有的流式传输。 When a new request comes into this method, it does not get a new streaming object, it tries to create a dstream. 当新请求进入此方法时，它不会获得新的流对象，而是尝试创建dstream。 The issue is the DStream object is never returned. 问题是永远不会返回DStream对象。

KafkaUtils.createDirectStream(javaStreamingContext,
                    LocationStrategies.PreferConsistent(),
                    ConsumerStrategies.<String, String>Subscribe(topics, getKafkaParamMap()))

This does not return a dstream, the control just terminates without an error.The steps further are not executed. 这不会返回dstream，控件会立即终止而不会出现错误。进一步的步骤将不会执行。

I have tried many things and read multiple article, but I belive this is a very common production level issue. 我已经尝试了很多事情并阅读了多篇文章，但是我相信这是一个非常常见的生产级别问题。 Any streaming done is to be done on multiple different topics and each of them is handled differently. 任何流式处理都将在多个不同的主题上进行，并且每个主题的处理方式都不同。

Please help 请帮忙

Answer 1

The thing is spark master sends out code to workers and although the data is streaming, underlying code and variable values remain static unless job is restarted. 问题是spark master将代码发送给工作人员，尽管数据正在流式传输，但除非重新启动作业，否则底层代码和变量值将保持静态。

Few options I could think: 我想不到的几种选择：

Spark Job server: Every time you want to subscribe/stream from a different topic instead of touching already running job, start a new job. Spark Job服务器：每次您要从其他主题进行订阅/流式传输而不是触摸已经在运行的作业时，就开始一个新作业。 From your API body you can supply the parameters or topic name. 您可以从API主体中提供参数或主题名称。 If you want to stop streaming from a specific topic, just stop respective job. 如果要停止特定主题的流式传输，只需停止相应的作业即可。 It will give you a lot of flexibility and control on resources. 它将为您提供很大的灵活性和对资源的控制。
[Theoritical] Topic Filter: Subscribe all topics you think you will want, when records are pulled for a duration, filter out records based on a LIST of topics. [理论]主题过滤器：订阅您认为想要的所有主题，当记录被拉长一段时间后，将根据主题列表过滤出记录。 Manipulate this list of topics through API to increase or decrease your scope of topics, it could be a broadcast variable as well. 通过API处理此主题列表以增加或减小主题范围，它也可以是广播变量。 This is just an idea, I have not tried this option at all. 这只是一个想法，我根本没有尝试过此选项。
Another work around is to relay your Topic-2 data to Topic-1 using a microservice whenever you need it & stop if you don't want to. 另一个解决方法是，在需要时使用微服务将Topic-2数据中继到Topic-1，并在不需要时停止。

流启动后火花流新作业

问题描述

1 个解决方案

解决方案1
1 2018-07-15 19:48:51

流启动后火花流新作业

问题描述

1 个解决方案

解决方案1 1 2018-07-15 19:48:51

解决方案1
1 2018-07-15 19:48:51