简体   繁体   中英

Read, transform and stream to Hadoop

I need to build a server that reads large csv data files (100GBs) in a directory, transforms some fields and streams them to a Hadoop cluster.

These files are copied over from other servers at random time (100s times/day). It takes a long time to finish copying a file.

I need to:

  1. Regularly check for new files to process (ie, encrypt and stream)
  2. Check if a csv is completely copied over to kick off encryption
  3. Process Stream multiple files in parallel, but prevent two processes to stream the same file
  4. Mark files being streamed successfully
  5. Mark files being streamed unsuccessfully and restart the streaming process.

My question is: is there an open source ETL tool that provide all of the 5, and works well with Hadoop/Spark Stream? I assume this process is fairly standard, but I couldn't find any yet.

Thank you.

Flume or Kafka will serve your purpose. Both are well integrated with Spark and Hadoop.

Try taking a look at the great library https://github.com/twitter/scalding . Maybe it can point you in the right direction :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM