简体   繁体   中英

Reading fileStream with Spark Streaming

I have a directory on HDFS where every 10 minutes a file is copied (the existing one is overwritten). I'd like to read the content of a file with Spark streaming ( 1.6.0 ) and use it as a reference data to join it to an other stream.

I set the " remember window " spark.streaming.fileStream.minRememberDuration to " 600s " and set newFilesOnly to false , because when I start the application I wan't to fetch the initial data from HDFS which is already there.

val ssc = new StreamingContext(sparkConf, Seconds(2))
def defaultFilter(path: Path): Boolean = !path.getName().startsWith(".")
val lines: DStream[String] = 
   ssc.fileStream[LongWritable, Text, TextInputFormat](loc, defaultFilter(_), false).map(_._2.toString)
lines.foreachRDD { x => x.foreach(println) }

My idea is to persist the content of this DStream into memory and delegate the task of maintaining this " batch lookup cache " to Spark. I expect to automatically have fresh data after each change on the HDFS directory which I can join to the other stream.

What I don't understand:

  • when I start the application the data is loaded but then if I touch the file locally and overwrite the one on HDFS I won't see its content printed out anymore
  • how to cache and reload this data?
  • When I cache it will this be available on the worker nodes or this (along with the join) will happen in the driver?

Should I also set the StreamingContext time interval to 10 minutes as I will only have changes every 10 minutes?

Just a few raw ideas.

when I start the application the data is loaded but then if I touch the file locally and overwrite the one on HDFS I won't see its content printed out anymore

For Spark Streaming to work with the data, the files have to be created atomically, eg by moving the file into the directory Spark is monitoring. the File rename operation is typically atomic. Can you please test this to verify it is working?

how to cache and reload this data? When I cache it will this be available on the worker nodes or this (along with the join) will happen in the driver?

The Straightforward solution might be to register temp table in foreachRDD() method. When new data will come during streaming appropriate table can be recreated. Keep in mind that logic inside foreachRDD() method should be idempotent.

Knowing the table name you can easily create a separate pipeline for querying that will join data from this precached temp table. Just make sure that you set the StreamingContext to remember a sufficient amount of streaming data such that the query can run.

Should I also set the StreamingContext time interval to 10 minutes as I will only have changes every 10 minutes?

In ideal case cadence should match. Just to be safe you can check timestamp when the new data received in foreachRDD() method too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM