简体   繁体   中英

Read new s3 file paths from spark streaming

I'd like to use spark streaming to monitor a s3 directory and return the path of any new files that are added to that directory. Neither textFileStream nor fileStream seem to be able to do this. Is there actually a way to accomplish what I'd like to do?

Edit: Spark ver. 2.1.0

It does this using s3a:// ; I have tests to prove it.

  1. set up a big enough window to deal with delays in scanning the dir, and clean it up.
  2. you can write straight into the destination "directory"; no need to write and then rename. If you do that: the files are copied and pick up the window.
  3. don't try checkpointing there

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM