Spark Scala - Rolling window with merge

Question

The context: I am building a Java spark streaming job to collect logs. I apply a 10min window to the stream. After the window, I need to merge the current window with the 2 previous windows (hence 3 windows of 10min = 30min of data) and stream out. So stream IN by 10min batch, stream OUT every 10min with 30min batch of data.

The only way I could think of is writing the data into SQL (Parquet file here), then reading the data for the last 30min. How can I do to keep everything in memory and avoid the disk read/write latency? I still need to persist the windows every time so I can recover the state.

// Incoming data
stream.window(Minutes(10))
      .foreachRDD(rdd => {
  val df = spark.createDataFrame(rdd)
  df.withColumn("time", to_utc_timestamp(col("time"), "yyyy-MM-dd'T'HH:mm:ss").cast(TimestampType))
  df.write.mode(SaveMode.Append).parquet(parquetFile)
  df.show()
})


// Read last 3 windows
val outstream = spark.read.load(parquetFile)
outstream.select("time", "line").where("time >= current_timestamp() - INTERVAL 30 MINUTES")
outstream.show()

Thanks

Answer 1

I believe you need to specify sliding interval. I believe in your case you need window size 30 minutes and sliding interval 10 minutes. And then you don't need to store data, but directly process window of 30 minutes.

Please take a look at this answer.

Spark Scala - Rolling window with merge

Question

1 answers

solution1
0 ACCPTED 2019-11-21 08:28:30

Spark Scala - Rolling window with merge

Question

1 answers

solution1 0 ACCPTED 2019-11-21 08:28:30

solution1
0 ACCPTED 2019-11-21 08:28:30