简体   繁体   English

hadoop - 多个集群上的Map reduce

[英]hadoop - Map reduce on multiple cluster

I have configured Hadoop cluster . 我已经配置了Hadoop集群。 And im having two machines MA and MB When i run the mapreduce program using the following code 我有两台机器MAMB当我使用以下代码运行mapreduce程序

 hadoop  jar /HDP/hadoop-1.2.0.1.3.0.0-0380/contrib/streaming/hadoop-streaming-1.2.0.1.3.0.0-0380.jar  -mapper "python C:\Python33\mapper.py"  -reducer "python C:\Python33\redu.py"  -input "/user/XXXX/input/input.txt"  -output "/user/XXXX/output/out20131112_09"

where : mapper - C:\\Python33\\mapper.py and reducer C:\\Python33\\redu.py is in MB's local disk 其中:mapper - C:\\ Python33 \\ mapper.py和reducer C:\\ Python33 \\ redu.py在MB的本地磁盘中

UPDATE UPDATE 在此输入图像描述

Finally i have tracked down the error . 最后,我已经找到了错误。

MA- error log MA-错误日志

stderr logs
python: can't open file 'C:\Python33\mapper.py': [Errno 2] No such file or directory
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2

mapper - C:\\Python33\\mapper.py and reducer C:\\Python33\\redu.py is in MA's local disk and it is not in MB mapper - C:\\ Python33 \\ mapper.py和reducer C:\\ Python33 \\ redu.py在MA的本地磁盘中 ,它不是MB

Now , Do i need to copy my m/r program to MA or how shall i resolve this 现在,我需要将我的m / r程序复制到MA,或者我该如何解决这个问题

Mapper 映射器

import sys
for line in sys.stdin:
   line = line.strip()
   keys = line.split()
   for key in keys:
       value = 1
       print( '%s \t %d' % (key, value))

If the map input file is smaller than dfs.block.size then you will end with only one task per job running. 如果映射输入文件小于dfs.block.size则每个作业运行时最终只有一个任务。 For small inputs you can force Hadoop to run multiple tasks with mapred.max.split.size value in bytes being smaller than dfs.block.size . 对于小输入,您可以强制Hadoop以mapred.max.split.size值(小于dfs.block.size运行多个任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM