I need architectural suggestion for this problem I'm working on. I have log files coming in every 15 minutes in gzipped folder. Each of these have about 100,000 further files to process. I have a python code that does the same processing on each of those files. There is no map reduce code. Just that we are rearranging data in that folder.
I want to use parallel processing power of Hadoop to process these files faster. So, my question is, do I always have to write a map/ reduce code to use parallel processing power of hadoop or there is a way to run my current Python code as is on the EMR instance and process these files in parallel?
Thank you for your help, Amey
Maybe.
Check out Hadoop Streaming.
You can do "map only" jobs using Hadoop Streaming. Add the following to your hadoop command that starts the job:
-D mapred.reduce.tasks=0
No.
MapReduce is one framework that runs on top of Hadoop. Even if you use it, you can configure jobs without a reducer. This will basically run the map code on each of your inputs and output whatever the map tasks output.
You can also write applications natively on top of YARN. I don't have much experience with this, so I'll refer you to the docs. It look like a pretty heavy, Java-centric process.
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.