[英]hadoop-streaming: automate post-processing once job is completed?
Step 1- I have a hadoop streaming job that takes variable time based on amount of data to process Step 2- Once the job is done, I need to import all that data dump into mongodb and create a flat csv file out of it
第1步,我有一个hadoop流工作,它根据要处理的数据量花费可变的时间。第2步,完成工作后,我需要将所有数据转储导入mongodb并从中创建一个平面的csv文件
Question 题
Is there any way I can glue Step 2 to Step 1 using hadoop streaming and avoid doing Step 2 manually? 有什么方法可以使用hadoop流将步骤2粘贴到步骤1,并避免手动执行步骤2?
I would recommend using something like https://github.com/Yelp/mrjob or https://github.com/klbostee/dumbo . 我建议使用类似https://github.com/Yelp/mrjob或https://github.com/klbostee/dumbo之类的东西。 Specifically for mrjob and your problem http://packages.python.org/mrjob/job.html#writing-multi-step-jobs
专门针对mrjob和您的问题http://packages.python.org/mrjob/job.html#writing-multi-step-jobs
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.