简体   繁体   English

在单个mapreduce中产生最大和最小

[英]Yield both max and min in a single mapreduce

I am a beginner just getting started with writing MapReduce programs in Python using MRJob library. 我是一个初学者,刚开始使用MRJob库在Python中编写MapReduce程序。

One of the example worked out in the video tutorial is to find a max temperature by location_id. 视频教程中解决的示例之一是通过location_id查找最高温度。 Following on from that writing another program to find the min temperature by location_id is straightforward too. 接下来,编写另一个程序以location_id查找最低温度也很简单。

I am wondering, is there a way to yield both max and min temperature by location_id in a single mapreduce program?. 我想知道,有没有办法在单个mapreduce程序中通过location_id产生最高和最低温度? Below is my go at it: 下面是我的努力:

from mrjob.job import MRJob

'''Sample Data
ITE00100554,18000101,TMAX,-75,,,E,
ITE00100554,18000101,TMIN,-148,,,E,
GM000010962,18000101,PRCP,0,,,E,
EZE00100082,18000101,TMAX,-86,,,E,
EZE00100082,18000101,TMIN,-135,,,E,
ITE00100554,18000102,TMAX,-60,,I,E,
ITE00100554,18000102,TMIN,-125,,,E,
GM000010962,18000102,PRCP,0,,,E,
EZE00100082,18000102,TMAX,-44,,,E, 

Output I am expecting to see:
ITE00100554  32.3  20.2
EZE00100082  34.4  19.6
'''

class MaxMinTemperature(MRJob):
    def mapper(self, _, line):
        location, datetime, measure, temperature, w, x, y, z = line.split(',')
        temperature = float(temperature)/10
        if measure == 'TMAX' or measure == 'TMIN':
            yield location, temperature

    def reducer(self, location, temperatures):
        yield location, max(temperatures), min(temperatures)


if __name__ == '__main__':
    MaxMinTemperature.run()

I get the following error: 我收到以下错误:

File "MaxMinTemperature.py", line 12, in reducer
yield location, max(temperatures), min(temperatures)
ValueError: min() arg is an empty sequence

Is this possible? 这可能吗?

Thank you for your assistance. 谢谢您的帮助。

Shiv 希夫

You have two problems in reducer: 减速器有两个问题:

  1. If you check type of the temperature argument, you will see that it's a generator. 如果检查温度参数的类型,您将看到它是一个生成器。 A generator can be traversed only once so you cannot pass the same generator to both 'min' and 'max' functions. 生成器只能被遍历一次,因此您不能将同一生成器传递给“最小”和“最大”功能。 The right solution is to manually traverse it. 正确的解决方案是手动遍历它。 A wrong solution - converting it to a list - may cause out of memory error on big enough input because a list holds all its elements in memory and a generator does not. 错误的解决方案-将其转换为列表-可能会在足够大的输入上导致内存不足错误,因为列表将其所有元素保存在内存中,而生成器则不会。

  2. Result of reducer must be a two-elements tuple. reducer的结果必须是两个元素的元组。 So you need to combine your min and max temperature in another tuple. 因此,您需要在另一个元组中组合最低和最高温度。

Complete working solution: 完整的工作解决方案:

class MaxMinTemperature(MRJob):
    def mapper(self, _, line):
        location, datetime, measure, temperature, w, x, y, z = line.split(',')
        temperature = float(temperature)/10
        if measure in ('TMAX', 'TMIN'):
            yield location, temperature

    def reducer(self, location, temperatures):
        min_temp = next(temperatures)
        max_temp = min_temp
        for item in temperatures:
            min_temp = min(item, min_temp)
            max_temp = max(item, max_temp)
        yield location, (min_temp, max_temp)

The problem is that temperatures in your reducer method is a generator . 问题在于reducer方法中的temperatures发生器


For better understanding let's create a simple generator and look on its behavior: 为了更好地理解,让我们创建一个简单的生成器并查看其行为:

def my_gen(an_iterable):
    for item in an_iterable:
        yield item

my_generator = my_gen([1,2,3,4,5])
print(type(my_generator)) # <class 'generator'>

Оne of the features of such an object is that once exhausted, you can't reuse it: 此类对象的功能之一是,一旦耗尽,就无法重用:

print(list(my_generator)) # [1, 2, 3, 4, 5]
print(list(my_generator)) # []

Therefore sequential execution of max() and min() leads to an error: 因此,依次执行max()min()会导致错误:

my_generator = my_gen([1,2,3,4,5])

print(max(my_generator)) # 5
print(min(my_generator)) # ValueError: min() arg is an empty sequence

So, you can't use the same generator with both max() and min() built-in functions because in the second use the generator will be exhausted. 因此,您不能同时使用具有内置函数max()min()的同一生成器,因为在第二次使用中,该生成器将被耗尽。


Instead you can: 相反,您可以:

1) convert the generator to a list and work with it: 1)将生成器转换为列表并使用它:

my_generator = my_gen([1,2,3,4,5])
my_list = list(my_generator)

print(max(my_list)) # 5
print(min(my_list)) # 1 

2) or extract min and max values of the generator within 1 for-loop: 2)或在1个for循环内提取发生器的最小值和最大值:

my_generator = my_gen([1,2,3,4,5])

from functools import reduce
val_max, val_min = reduce(lambda x,y: (max(y, x[0]), min(y, x[1])), my_generator, (float('-inf'), float('inf'))) 

print(val_max, val_min) # 5 1

So, the following edit of reducer : 因此, reducer的以下编辑:

def reducer(self, location, temperatures):
    tempr_list = list(temperatures)
    yield location, max(tempr_list), min(tempr_list)

should fix the error. 应该解决错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Hadoop Map中最高/最低温度的python mapreduce示例 - python mapreduce example for max/min temperature in hadoop 求和两列,计算 MapReduce 中的最大值、最小值和平均值 - sum two columns, calculate max, min and mean value in MapReduce 频率的最大值和最小值均返回相同的值 - Max and Min of Frequency Both Returning Same Value 使用 pandas 找到该数量的最小和最大数量以及最小和最大价格 - find both min and max quantity and min and max price of that quantity using pandas 选择使用Min和Max Python从两端排序 - Selection Sort from both ends with Min and Max Python 在固定域中查找单个变量函数的最小值/最大值的算法 - Algorithms to find min/max of a single variable function in fixed domain 如何在单个列中的几个最大值之间找到最小值? - How to find min value between several max values in a single column? 在单次传递数据时使用pyspark查找最小值/最大值 - finding min/max with pyspark in single pass over data 从单个列值 pandas 创建 Max 和 Min 列值 - Create Max and Min column values from a single column value pandas 当我使用最小最大缩放数据训练模型时,如何使用最小最大缩放器预测单个数据样本? - How can I predict single sample of data using min max scaler when I am training the model with min max scaled data?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM