简体   繁体   English

Spark流:如何编写累积输出?

[英]Spark streaming: How to write cumulative output?

I have to write a single output file for my streaming job. 我必须为流工作编写一个输出文件。

Question : when will my job actually stop? 问题:我的工作什么时候才能真正停止? I killed the server but did not work. 我杀死了服务器,但是没有工作。 I want to stop my job from commandline(If it is possible) 我想从命令行停止我的工作(如果可能)

Code: 码:

    import org.apache.spark.streaming.StreamingContext
    import org.apache.spark.streaming.StreamingContext._
    import org.apache.spark.streaming.dstream
    import org.apache.spark.streaming.Duration
    import org.apache.spark.streaming.Seconds
    import org.apache.spark._
    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    import org.apache.spark.SparkConf
    import scala.collection.mutable.ArrayBuffer



    object MAYUR_BELDAR_PROGRAM5_V1 {

    def main(args: Array[String]) {

        val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
        val ssc = new StreamingContext(conf, Seconds(10))

        val lines = ssc.socketTextStream("localhost", args(0).toInt)
        val words = lines.flatMap(_.split(" "))

        val class1 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==0)
        val class2 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==1)
    val class3 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==0)
    val class4 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==1)

    class1.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class1","txt")
    class2.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class2", "txt")
    class3.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class3","txt")
    class4.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class4","txt")

    ssc.start()             // Start the computation
    ssc.awaitTermination()
    ssc.stop()

    }
    }

A stream by definition does not have an end so it will not stop unless you call the method to stop it. 根据定义,流没有结束,因此除非您调用方法将其停止,否则它不会停止。 In my case I have a business condition that tell when the process is finished, so when I reach this point I'm calling the method JavaStreamingContext.close(). 在我的情况下,我有一个业务条件来告知该过程何时完成,所以到达这一点时,我将调用方法JavaStreamingContext.close()。 I also have a monitor that checks if the process has not received any data in the past few minutes in which case it will also close the stream. 我还有一个监视器,用于检查该进程在过去几分钟内是否未收到任何数据,在这种情况下,它还将关闭流。

In order to accumulate data you have to use the method updateStateByKey (on a PairDStream). 为了积累数据,您必须使用方法updateStateByKey(在PairDStream上)。 This method requires checkpointing to be enabled. 此方法需要启用检查点。

I have checked the Spark code and found that saveAsTextFiles uses foreachRDD, so at the end it will save each RDD separately, so previous RDDs will not be taken into account. 我检查了Spark代码,发现saveAsTextFiles使用foreachRDD,因此最后它将分别保存每个RDD,因此不会考虑以前的RDD。 Using updateStateByKey it will still save multiple files, but each file will consider all RDDs that were processed before. 使用updateStateByKey仍将保存多个文件,但是每个文件将考虑之前处理过的所有RDD。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM