如何停用Hadoop流中的輸出？

Question

我在集群上編寫Python mapreduce程序。 我的映射器解析數據並將其存儲在HBase中。 沒有減速器，沒有輸出。

如有必要，下面是供參考的代碼。

class Mapper:
  ...
  def __init__(...)
     ...

  def start(self, file):
    generator = self.read_input(file)
    connection = happybase.Connection(Mapper.IP)
    self.table = connection.table(Mapper.table_name)
    for line in generator:
      self.parse(line)
      self.write()
      self.buffers = []
    self.table = None
    connection.close()

  def read_input(self, file):
    ...
  def parse(self, line):
    ...
  def write(self):
    # write buffers into HBase
    for cell in self.buffers:
      self.table.put(cell[0], cell[1])     <-  Into HBase yay

我的問題是：如果我在集群中使用此命令：

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
-D mapred.reduce.tasks=1 \
-file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
-input /user/hduser/streamingTest/testFile.csv

它將說： 哎呀，錯誤streaming.StreamJob：缺少必需的選項：輸出

我可以將輸出重定向到stdout還是完全將其停用？

PS：我是一個糟糕的python程序員，請指出任何使您不舒服的代碼。

Answer 1

您將需要生成一些輸出。 出於不輸出任何內容的願望，請使用

NullOutputFormat

如下：

---outputformat org.apache.mapreduce.lib.NullOutputFormat

如何停用Hadoop流中的輸出？

問題描述

1 個解決方案

解決方案1
1 已采納 2015-03-29 22:47:17

如何停用Hadoop流中的輸出？

問題描述

1 個解決方案

解決方案1 1 已采納 2015-03-29 22:47:17

解決方案1
1 已采納 2015-03-29 22:47:17