繁体   English   中英

如何为火花流检查点设置检查点间隔?

[英]How to set checkpoint Interval for spark streaming checkpointing?

我想根据官方文档为我的python spark流脚本设置checkpoint Interval:

对于需要RDD检查点的有状态转换,默认间隔是批处理间隔的倍数,至少为10秒。 可以使用dstream.checkpoint(checkpointInterval)进行设置。 通常,DStream的滑动间隔的5-10倍的检查点间隔是尝试的良好设置。

我的脚本:

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

def functionToCreateContext():
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 6)
    ssc.checkpoint("./checkpoint")
    kvs = KafkaUtils.createDirectStream(ssc, ['test123'], {"metadata.broker.list": "localhost:9092"})

    kvs = kvs.checkpoint(60) #set the checkpoint interval

    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")) \
        .map(lambda word: (word, 1)) \
        .reduceByKey(lambda a, b: a+b)
    counts.pprint()
    return ssc

if __name__ == "__main__":
    ssc = StreamingContext.getOrCreate("./checkpoint", functionToCreateContext)

    ssc.start()
    ssc.awaitTermination()

运行脚本后的输出:

16/05/25 17:49:03 INFO DirectKafkaInputDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO DirectKafkaInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/05/25 17:49:03 INFO DirectKafkaInputDStream: Checkpoint interval = null
16/05/25 17:49:03 INFO DirectKafkaInputDStream: Remember duration = 120000 ms
16/05/25 17:49:03 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka.DirectKafkaInputDStream@1be80174
16/05/25 17:49:03 INFO PythonTransformedDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Storage level = StorageLevel(false, true, false, false, 1)
16/05/25 17:49:03 INFO PythonTransformedDStream: Checkpoint interval = 60000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Remember duration = 120000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Initialized and validated org.apache.spark.streaming.api.python.PythonTransformedDStream@69f9a089
16/05/25 17:49:03 INFO PythonTransformedDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/05/25 17:49:03 INFO PythonTransformedDStream: Checkpoint interval = null
16/05/25 17:49:03 INFO PythonTransformedDStream: Remember duration = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Initialized and validated org.apache.spark.streaming.api.python.PythonTransformedDStream@d97386a
16/05/25 17:49:03 INFO PythonTransformedDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/05/25 17:49:03 INFO PythonTransformedDStream: Checkpoint interval = null
16/05/25 17:49:03 INFO PythonTransformedDStream: Remember duration = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Initialized and validated org.apache.spark.streaming.api.python.PythonTransformedDStream@16c474ad
16/05/25 17:49:03 INFO ForEachDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/05/25 17:49:03 INFO ForEachDStream: Checkpoint interval = null
16/05/25 17:49:03 INFO ForEachDStream: Remember duration = 6000 ms
..........

DStream检查点间隔仍为空。 对此有何看法?

在创建流后,尝试将此行向下移动几行: ssc.checkpoint("./checkpoint")

基本上,在完全准备好流之后执行检查点。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM