简体   繁体   English

在hadoop上的一个流作业中使用多个映射器输入?

[英]Using multiple mapper inputs in one streaming job on hadoop?

In java I would use: 在Java中,我将使用:

MultipleInputs.addInputPath(conf, path, inputFormatClass, mapperClass) MultipleInputs.addInputPath(conf,path,inputFormatClass,mapperClass)

to add multiple inputs with a different mapper for each. 为每个输入添加具有不同映射器的多个输入。

Now I am using python to write a streaming job in hadoop, can a similiar job be done? 现在我正在使用python在hadoop中编写流作业,可以完成类似的作业吗?

您可以使用多个-input选项来指定多个输入路径:

hadoop jar hadoop-streaming.jar -input foo.txt -input bar.txt ...

I suppose this can help you: https://github.com/hyonaldo/hadoop-multiple-streaming . 我想这可以为您提供帮助: https : //github.com/hyonaldo/hadoop-multiple-streaming

Here you can see "different mappers for these different input path" as well: 在这里,您还可以看到“这些不同输入路径的不同映射器”:

hadoop jar hadoop-multiple-streaming.jar \  
  -input    myInputDirs \  
  -multiple "outputDir1|mypackage.Mapper1|mypackage.Reducer1" \  
  -multiple "outputDir2|mapper2.sh|reducer2.sh" \  
  -multiple "outputDir3|mapper3.py|reducer3.py" \  
  -multiple "outputDir4|/bin/cat|/bin/wc" \  
  -libjars  "libDir/mypackage.jar" \
  -file     "libDir/mapper2.sh" \  
  -file     "libDir/mapper3.py" \  
  -file     "libDir/reducer2.sh" \  
  -file     "libDir/reducer3.py"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM