I need to build a server that reads large csv data files (100GBs) in a directory, transforms some fields and streams them to a Hadoop cluster.
These files are copied over from other servers at random time (100s times/day). It takes a long time to finish copying a file.
I need to:
My question is: is there an open source ETL tool that provide all of the 5, and works well with Hadoop/Spark Stream? I assume this process is fairly standard, but I couldn't find any yet.
Thank you.
Try taking a look at the great library https://github.com/twitter/scalding . Maybe it can point you in the right direction :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.