简体   繁体   中英

Spark Scala - Rolling window with merge

The context: I am building a Java spark streaming job to collect logs. I apply a 10min window to the stream. After the window, I need to merge the current window with the 2 previous windows (hence 3 windows of 10min = 30min of data) and stream out. So stream IN by 10min batch, stream OUT every 10min with 30min batch of data.

The only way I could think of is writing the data into SQL (Parquet file here), then reading the data for the last 30min. How can I do to keep everything in memory and avoid the disk read/write latency? I still need to persist the windows every time so I can recover the state.

// Incoming data
stream.window(Minutes(10))
      .foreachRDD(rdd => {
  val df = spark.createDataFrame(rdd)
  df.withColumn("time", to_utc_timestamp(col("time"), "yyyy-MM-dd'T'HH:mm:ss").cast(TimestampType))
  df.write.mode(SaveMode.Append).parquet(parquetFile)
  df.show()
})


// Read last 3 windows
val outstream = spark.read.load(parquetFile)
outstream.select("time", "line").where("time >= current_timestamp() - INTERVAL 30 MINUTES")
outstream.show()

Thanks

I believe you need to specify sliding interval. I believe in your case you need window size 30 minutes and sliding interval 10 minutes. And then you don't need to store data, but directly process window of 30 minutes.

Please take a look at this answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM