![](/img/trans.png)
[英]What is the Use of setting Interval for checkpoint in spark streaming?
[英]How to set checkpoint Interval for spark streaming checkpointing?
我想根據官方文檔為我的python spark流腳本設置checkpoint Interval:
對於需要RDD檢查點的有狀態轉換,默認間隔是批處理間隔的倍數,至少為10秒。 可以使用dstream.checkpoint(checkpointInterval)進行設置。 通常,DStream的滑動間隔的5-10倍的檢查點間隔是嘗試的良好設置。
我的腳本:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def functionToCreateContext():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 6)
ssc.checkpoint("./checkpoint")
kvs = KafkaUtils.createDirectStream(ssc, ['test123'], {"metadata.broker.list": "localhost:9092"})
kvs = kvs.checkpoint(60) #set the checkpoint interval
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
return ssc
if __name__ == "__main__":
ssc = StreamingContext.getOrCreate("./checkpoint", functionToCreateContext)
ssc.start()
ssc.awaitTermination()
運行腳本后的輸出:
16/05/25 17:49:03 INFO DirectKafkaInputDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO DirectKafkaInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/05/25 17:49:03 INFO DirectKafkaInputDStream: Checkpoint interval = null
16/05/25 17:49:03 INFO DirectKafkaInputDStream: Remember duration = 120000 ms
16/05/25 17:49:03 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka.DirectKafkaInputDStream@1be80174
16/05/25 17:49:03 INFO PythonTransformedDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Storage level = StorageLevel(false, true, false, false, 1)
16/05/25 17:49:03 INFO PythonTransformedDStream: Checkpoint interval = 60000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Remember duration = 120000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Initialized and validated org.apache.spark.streaming.api.python.PythonTransformedDStream@69f9a089
16/05/25 17:49:03 INFO PythonTransformedDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/05/25 17:49:03 INFO PythonTransformedDStream: Checkpoint interval = null
16/05/25 17:49:03 INFO PythonTransformedDStream: Remember duration = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Initialized and validated org.apache.spark.streaming.api.python.PythonTransformedDStream@d97386a
16/05/25 17:49:03 INFO PythonTransformedDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/05/25 17:49:03 INFO PythonTransformedDStream: Checkpoint interval = null
16/05/25 17:49:03 INFO PythonTransformedDStream: Remember duration = 6000 ms
16/05/25 17:49:03 INFO PythonTransformedDStream: Initialized and validated org.apache.spark.streaming.api.python.PythonTransformedDStream@16c474ad
16/05/25 17:49:03 INFO ForEachDStream: Slide time = 6000 ms
16/05/25 17:49:03 INFO ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/05/25 17:49:03 INFO ForEachDStream: Checkpoint interval = null
16/05/25 17:49:03 INFO ForEachDStream: Remember duration = 6000 ms
..........
DStream檢查點間隔仍為空。 對此有何看法?
在創建流后,嘗試將此行向下移動幾行: ssc.checkpoint("./checkpoint")
基本上,在完全准備好流之后執行檢查點。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.