简体繁体中英

Read, transform and stream to Hadoop

原文 2015-07-17 06:43:41 9 2 hadoop/ etl

I need to build a server that reads large csv data files (100GBs) in a directory, transforms some fields and streams them to a Hadoop cluster.

These files are copied over from other servers at random time (100s times/day). It takes a long time to finish copying a file.

I need to:

Regularly check for new files to process (ie, encrypt and stream)
Check if a csv is completely copied over to kick off encryption
Process Stream multiple files in parallel, but prevent two processes to stream the same file
Mark files being streamed successfully
Mark files being streamed unsuccessfully and restart the streaming process.

My question is: is there an open source ETL tool that provide all of the 5, and works well with Hadoop/Spark Stream? I assume this process is fairly standard, but I couldn't find any yet.

Thank you.

2 answers

Flume or Kafka will serve your purpose. Both are well integrated with Spark and Hadoop.

Try taking a look at the great library https://github.com/twitter/scalding . Maybe it can point you in the right direction :)

Read Hadoop SequenceFile: weird hex number stream

Hadoop read from standard input stream

Hadoop stream sorting

Pig Hadoop Stream help

How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce

hadoop stream, how to set partition?

Hadoop File Read

Read sequencefile in Hadoop 2.0

Hadoop configurations seem not to be read

Spark cluster - read/write on hadoop

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Read Hadoop SequenceFile: weird hex number stream Hadoop read from standard input stream Hadoop stream sorting Pig Hadoop Stream help How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce hadoop stream, how to set partition? Hadoop File Read Read sequencefile in Hadoop 2.0 Hadoop configurations seem not to be read Spark cluster - read/write on hadoop

Related Tags

Read, transform and stream to Hadoop

Question

2 answers

solution1
1 2015-07-17 06:54:38

solution2
0 2015-07-17 06:53:50

Read, transform and stream to Hadoop

Question

2 answers

solution1 1 2015-07-17 06:54:38

solution2 0 2015-07-17 06:53:50

solution1
1 2015-07-17 06:54:38

solution2
0 2015-07-17 06:53:50