使用AWS EMR进行文件处理

Question

I need architectural suggestion for this problem I'm working on. 对于我正在研究的这个问题，我需要建筑建议。 I have log files coming in every 15 minutes in gzipped folder. 我在gzip文件夹中每隔15分钟就有一个日志文件。 Each of these have about 100,000 further files to process. 其中每个都有大约100,000个要处理的文件。 I have a python code that does the same processing on each of those files. 我有一个python代码，对每个文件执行相同的处理。 There is no map reduce code. 没有地图缩减代码。 Just that we are rearranging data in that folder. 只是我们正在重新安排该文件夹中的数据。

I want to use parallel processing power of Hadoop to process these files faster. 我想使用Hadoop的并行处理能力来更快地处理这些文件。 So, my question is, do I always have to write a map/ reduce code to use parallel processing power of hadoop or there is a way to run my current Python code as is on the EMR instance and process these files in parallel? 所以，我的问题是，我是否总是要写一个map / reduce代码来使用hadoop的并行处理能力，或者有一种方法可以在EMR实例上运行我当前的Python代码并并行处理这些文件？

Thank you for your help, Amey 谢谢你的帮助，Amey

Answer 1

Can I run my current Python code? 我可以运行我当前的Python代码吗？

Maybe. 也许。

Check out Hadoop Streaming. 看看Hadoop Streaming。

http://hadoop.apache.org/docs/r2.5.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html http://hadoop.apache.org/docs/r2.5.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html

You can do "map only" jobs using Hadoop Streaming. 您可以使用Hadoop Streaming执行“仅映射”作业。 Add the following to your hadoop command that starts the job: 将以下内容添加到启动作业的hadoop命令中：

 -D mapred.reduce.tasks=0

Do I always have to use MapReduce? 我是否总是必须使用MapReduce？

No. 没有。

MapReduce is one framework that runs on top of Hadoop. MapReduce是一个在Hadoop之上运行的框架。 Even if you use it, you can configure jobs without a reducer. 即使您使用它，也可以在没有减速器的情况下配置作业。 This will basically run the map code on each of your inputs and output whatever the map tasks output. 这将基本上在每个输入上运行地图代码并输出任何地图任务输出。

You can also write applications natively on top of YARN. 您还可以在YARN之上本地编写应用程序。 I don't have much experience with this, so I'll refer you to the docs. 我对此没有多少经验，所以我会把你推荐给文档。 It look like a pretty heavy, Java-centric process. 它看起来像一个相当沉重，以Java为中心的过程。

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

使用AWS EMR进行文件处理

问题描述

1 个解决方案

解决方案1
0 2014-09-16 16:35:51

Can I run my current Python code? 我可以运行我当前的Python代码吗？

Do I always have to use MapReduce? 我是否总是必须使用MapReduce？

使用AWS EMR进行文件处理

问题描述

1 个解决方案

解决方案1 0 2014-09-16 16:35:51

Can I run my current Python code? 我可以运行我当前的Python代码吗？

Do I always have to use MapReduce? 我是否总是必须使用MapReduce？

解决方案1
0 2014-09-16 16:35:51