简体   繁体   English

Python Map Reduce Mr工作

[英]Python Map Reduce Mr job

I am new to python programming so excuse me in advance if I ask something that is easily solved. 我是python编程的新手,所以如果我问一些容易解决的问题,请事先打扰。 I want to use MapReduce for processing a csv file that has some values and the output must return the maximum value.This is the script i've written so far: 我想使用MapReduce处理具有某些值的csv文件,并且输出必须返回最大值。这是我到目前为止编写的脚本:

from mrjob.job import MRJob

class MRWordCounter(MRJob):
def mapper(self, key, line):
    for word in line.split(','):
        yield 'MAXIMUM VALUE IN FILE:',int(word)


def reducer(self, word, occurrences):
    yield word, max(occurrences)


if __name__ == '__main__':
     MRWordCounter.run()

Now, the script works fine, it maps and reduces to the maximum value and prints it as an output but I think the way I implement it with the yield 'MAXIMUM VALUE IN FILE:' is incorrect since the mapper always sends that string to the reducer. 现在,脚本可以正常工作,可以映射并减小到最大值并将其打印为输出,但是我认为我使用'MAXIMUM VALUE IN FILE:'实现它的方式是错误的,因为映射器始终将字符串发送给减速器。 Can someone confirm if that is the incorrect way to implement it and recommend me how I can fix it? 有人可以确认这是否是错误的实现方式,并向我推荐如何解决该问题吗?

Your approach is correct. 您的方法是正确的。 As you mentioned, the mapper always sends MAXIMUM VALUE IN FILE: as the only key to the reducer, which means it is not relevant for the job in this stage. 如您所提到的,映射器始终将“ MAXIMUM VALUE IN FILE:作为唯一的密钥发送给reducer,这意味着它在此阶段与作业无关。 Remember that the mapper only does some bridge operations towards the final goal. 请记住,映射器只会对最终目标进行一些桥接操作。 Don't take this as a standard, but in my opinion, in terms of readability of your code, the values mapped are not the maximum value in file, therefore they should not be labeled with the key MAXIMUM VALUE IN FILE: . 不要将此作为标准,但是我认为,就代码的可读性而言,映射的值不是文件中的最大值,因此不应使用关键字MAXIMUM VALUE IN FILE:标记它们。 Only the reducer knows which is the maximum number, so that answer should be wrapped up by the reducer, labeling the final result. 只有化简器知道最大的数,因此应由化简器包装答案,并标记最终结果。

In that case you can just send None as a key from the mapper, and then add to the output of the reducer whatever you think describes better the final result, in this case, the maximum number. 在这种情况下,您可以只从映射器发送None作为键,然后将您认为能更好地描述最终结果的任何内容添加到化简器的输出中,在这种情况下,最大数目。

I would suggest this approach instead. 我建议改用这种方法。 (I took the liberty of changing some variable names to clarify what the code does) (我可以随意更改一些变量名来阐明代码的作用)

from mrjob.job import MRJob


class MRFindMax(MRJob):

  def mapper(self, _, line):
    for number in line.split(','):
      yield None, int(number)

  # Discard key, because it is None
  # After sort and group, the output is only one key-value pair (None, <all_numbers>)
  def reducer(self, _, numbers):
    yield "Max value is", max(numbers)


if __name__ == '__main__':
  MRFindMax.run()

I hope you find this answer useful for writing not only correct code as yours, but code that you feel more comfortable with. 我希望您发现此答案不仅对编写正确的代码有用,而且对您更熟悉的代码有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM