找到最小数量的hadoop流python

Question

我是hadoop框架和map减少抽象的新手。

基本上，我想在一个巨大的文本文件中找到最小的数字（以“，”分隔）

所以，这是我的代码mapper.py

 #!/usr/bin/env python

 import sys

 # input comes from STDIN (standard input)
 for line in sys.stdin:
 # remove leading and trailing whitespace
 line = line.strip()
 # split the line into words
numbers = line.split(",")
# increase counters
for number in numbers:
    # write the results to STDOUT (standard output);
    # what we output here will be the input for the
    # Reduce step, i.e. the input for reducer.py
    #
    # tab-delimited; the trivial word count is 1
    print '%s\t%s' % (number, 1)

减速器

  #!/usr/bin/env python

from operator import itemgetter
import sys
smallest_number = sys.float_info.max
for line in sys.stdin:
# remove leading and trailing whitespace
     line = line.strip()

# parse the input we got from mapper.py
     number, count = line.split('\t', 1)
     try:
           number = float(number)
     except ValueError:
            continue

     if number < smallest_number:
        smallest_number = number
        print smallest_number <---- i think the error is here... there is no key value thingy

     print smallest_number

我得到的错误：

       12/10/04 12:07:22 ERROR streaming.StreamJob: Job not successful. Error: NA
      12/10/04 12:07:22 INFO streaming.StreamJob: killJob...
          Streaming Command Failed!

Answer 1

首先，我希望您注意到除非您仅使用一个减速器，否则您的解决方案将无法正常工作。 的确，如果您使用多个减速器，则每个减速器都会吐出接收到的最小数字，最终您将得到多个数字。 但是接下来的问题是，如果我仅需要针对这个问题使用一个化简器（即，仅一项任务），那么使用MapReduce可以得到什么？ 这里的窍门是映射器将并行运行。 另一方面，您不希望映射器输出读取的每个数字，否则一个约化器将不得不浏览整个数据，这与顺序解决方案相比没有任何改进。 解决此问题的方法是让每个映射器仅输出其读取的最小数字。 另外，由于您希望所有映射器输出都转到相同的reducer，因此所有映射器上的映射器输出键必须相同。

映射器将如下所示：

#!/usr/bin/env python                              

import sys

smallest = None
for line in sys.stdin:
  # remove leading and trailing whitespace          
  line = line.strip()
  # split the line into words                       
  numbers = line.split(",")
  s = min([float(x) for x in numbers])
  if smallest == None or s < smallest:
    smallest = s

print '%d\t%f' % (0, smallest)

减速器：

#!/usr/bin/env python                                           

import sys

smallest = None
for line in sys.stdin:
  # remove leading and trailing whitespace                       
  line = line.strip()
  s = float(line.split('\t')[1])
  if smallest == None or s < smallest:
    smallest = s

print smallest

还有其他可能的方法来解决此问题，例如使用MapReduce框架本身对数字进行排序，以使reducer收到的第一个数字最小。 如果您想进一步了解MapReduce编程范例，可以从我的博客中阅读带有示例的本教程。

找到最小数量的hadoop流python

问题描述

1 个解决方案

解决方案1
0 2013-08-18 02:22:47

找到最小数量的hadoop流python

问题描述

1 个解决方案

解决方案1 0 2013-08-18 02:22:47

解决方案1
0 2013-08-18 02:22:47