File processing using AWS EMR

Question

I need architectural suggestion for this problem I'm working on. I have log files coming in every 15 minutes in gzipped folder. Each of these have about 100,000 further files to process. I have a python code that does the same processing on each of those files. There is no map reduce code. Just that we are rearranging data in that folder.

I want to use parallel processing power of Hadoop to process these files faster. So, my question is, do I always have to write a map/ reduce code to use parallel processing power of hadoop or there is a way to run my current Python code as is on the EMR instance and process these files in parallel?

Thank you for your help, Amey

Answer 1

Can I run my current Python code?

Maybe.

Check out Hadoop Streaming.

http://hadoop.apache.org/docs/r2.5.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html

You can do "map only" jobs using Hadoop Streaming. Add the following to your hadoop command that starts the job:

 -D mapred.reduce.tasks=0

Do I always have to use MapReduce?

No.

MapReduce is one framework that runs on top of Hadoop. Even if you use it, you can configure jobs without a reducer. This will basically run the map code on each of your inputs and output whatever the map tasks output.

You can also write applications natively on top of YARN. I don't have much experience with this, so I'll refer you to the docs. It look like a pretty heavy, Java-centric process.

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

File processing using AWS EMR

Question

1 answers

solution1
0 2014-09-16 16:35:51

Can I run my current Python code?

Do I always have to use MapReduce?

File processing using AWS EMR

Question

1 answers

solution1 0 2014-09-16 16:35:51

Can I run my current Python code?

Do I always have to use MapReduce?

solution1
0 2014-09-16 16:35:51