适用于特定数据集的最高/最低温度的python hadoop代码

Question

I am trying to make a mapper/reducer program to calculate max/min temp from a data set. 我正在尝试制作一个映射程序/还原程序，以从数据集中计算最大/最小温度。 I have tried to modify by myself but the code doesn't work. 我试图自己修改，但是代码不起作用。 The mapper runs fine but reducer doesn't, given I made changes in mapper. 鉴于我在mapper中进行了更改，因此mapper可以正常运行，但reducer不能正常运行。

My sample code: mapper.py 我的示例代码：mapper.py

import re
import sys

for line in sys.stdin:
  val = line.strip()
  (year, temp, q) = (val[14:18], val[25:30], val[31:32])
  if (temp != "9999" and re.match("[01459]", q)):
    print "%s\t%s" % (year, temp)

reducer.py reducer.py

import sys
   (last_key, max_val) = (None, -sys.maxint)
   for line in sys.stdin:
   (key, val) = line.strip().split("\t")
   if last_key and last_key != key:
        print "%s\t%s" % (last_key, max_val)
        (last_key, max_val) = (key, int(val))
        else:
        (last_key, max_val) = (key, max(max_val, int(val)))

    if last_key:
           print "%s\t%s" % (last_key, max_val)

sample line from file: 文件中的示例行：

690190,13910, 2012**0101, * 42.9 ,18, 29.4,18, 1033.3,18, 968.7,18, 10.0,18, 8.7,18, 15.0, 999.9, 52.5 , 31.6*, 0.00I,999.9, 000000, 690190,13910，2012 ** 0101 * 42.9，18，29.4,18，1033.3,18，968.7,18，10.0,18，8.7,18，15.0，999.9，52.5，31.6 *，0.00I，999.9，000000，

I need the values in bold. 我需要用粗体显示的值。 Any idea!! 任何想法！！

this is my output if i run mapper as a simple code: 如果我将mapper作为简单代码运行，这是我的输出：

root@ubuntu:/home/hduser/files# python maxtemp-map.py
2012    42.9
2012    50.0
2012    47.0
2012    52.0
2012    43.4
2012    52.6
2012    51.1
2012    50.9
2012    57.8
2012    50.7
2012    44.6
2012    46.7
2012    52.1
2012    48.4
2012    47.1
2012    51.8
2012    50.6
2012    53.4
2012    62.9
2012    62.6

The file contains different years data. 该文件包含不同的年份数据。 I have to calculate min, max, and avg for each yr. 我必须计算每年的最小值，最大值和平均值。

FIELD   POSITION  TYPE   DESCRIPTION

STN---  1-6       Int.   Station number (WMO/DATSAV3 number)
                         for the location.

WBAN    8-12      Int.   WBAN number where applicable--this is the
                         historical 
YEAR    15-18     Int.   The year.

MODA    19-22     Int.   The month and day.

TEMP    25-30     Real   Mean temperature. Missing = 9999.9


Count   32-33     Int.   Number of observations in mean temperature

Answer 1

I am having trouble parsing your question, but I think it reduces to this: 我在解析您的问题时遇到了麻烦，但是我认为这可以简化为：

You have a dataset and each line of the dataset represents different quantities related to a single time point. 您有一个数据集，数据集的每一行代表与单个时间点相关的不同数量。 You would like to extract the max/min of one of these quantities from the entire dataset. 您想从整个数据集中提取这些量之一的最大值/最小值。

If this is the case, I'd do something like this: 如果是这样，我会做这样的事情：

temps = []
with open(file_name, 'r') as infile:
    for line in infile:
        line = line.strip().split(',')
        year = int(line[2][:4])
        temp = int(line[3])
        temps.append((temp, year))

temps = sorted(temps)
min_temp, min_year = temps[0]
max_temp, max_year = temps[-1]

EDIT: 编辑：

Farley, I think what you are doing with mapper/reducer may be overkill for what you want from your data. Farley，我认为您使用mapper / reducer所做的事情可能对于您想要从数据中获取的内容过于刻薄。 Here are some additional questions about your initial file structure. 这是有关您的初始文件结构的其他一些问题。

What are the contents of each line (be specific) in the dataset? 数据集中每行的内容（具体而言）是什么？ For example: date, time, temp, pressure, ... . 例如： date, time, temp, pressure, ...
Which piece of data from each line do you want to extract? 您要从每一行中提取哪些数据？ Temperature? 温度？ At what position in the line is that piece of data? 该数据在行中的哪个位置？
Does each file only contain data from one year? 每个文件仅包含一年的数据吗？

For example, if your dataset looked like 例如，如果您的数据集看起来像

year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...
year, month, day, temp, pressure, cloud_coverage, ...

then the simplest thing to do is to loop through each line and extract the relevant information. 那么最简单的事情就是遍历每一行并提取相关信息。 It appears you only want the year and the temperature. 看来您只想要年份和温度。 In this example, these are located at positions 0 and 3 in each line. 在此示例中，它们位于每行的位置0和3 。 Therefore, we will have a loop that looks like 因此，我们将有一个看起来像

from collections import defaultdict
data = defaultdict(list)

with open(file_name, 'r') as infile:
    for line in infile:
        line = line.strip().split(', ')
        year = line[0]
        temp = line[3]
        data[year].append(temp)

See, we extracted the year and temp from each line in the file and stored them in a special dictionary object. 瞧，我们从文件的每一行中提取了year和temp ，并将它们存储在一个特殊的字典对象中。 What this will look like if we printed it out would be 如果我们将其打印出来，它将是什么样子

year1: [temp1, temp2, temp3, temp4]
year2: [temp5, temp6, temp7, temp8]
year3: [temp9, temp10, temp11, temp12]
year4: [temp13, temp14, temp15, temp16]

Now, this makes it very convenient for us to do statistics on all the temperatures of a given year. 现在，这使我们非常方便地对给定年份的所有温度进行统计。 For example, to compute the maximum, minimum, and average temperature, we could do 例如，要计算最高，最低和平均温度，我们可以

import numpy as np
for year in data:
    temps = np.array( data[year] )
    output = (year, temps.mean(), temps.min(), temps.max())
    print 'Year: {0} Avg: {1} Min: {2} Max: {3}'.format(output)

I'm more than willing to help you sort out your problem, but I need you to be more specific about what exactly your data looks like, and what you want to extract. 我非常愿意帮助您解决问题，但是我需要您更具体地了解数据的确切外观以及要提取的内容。

Answer 2

If you have something like the store name and total sales from the store as intermediate result from the mapper you can use the following as reducer to find out the maximum sales and which store has the maximum sales. 如果您有诸如商店名称和商店总销售额之类的东西作为映射器的中间结果，则可以使用以下内容作为化简来找出最大销售额，以及哪个商店具有最大销售额。 Similarly it will find out the minimum sales and which store has the minimum sales. 同样，它将找出最低销售额以及哪家商店的最低销售额。

The following reducer code example assumes that you have the sales total against each store as an input file. 以下减速器代码示例假定您具有每个商店的销售总额作为输入文件。

#! /usr/bin/python

import sys

mydict = {}

salesTotal = 0
oldKey = None

for line in sys.stdin:
    data=line.strip().split("\t")

    if len(data)!=2:
        continue

    thisKey, thisSale = data

    if oldKey and oldKey != thisKey:
        mydict[oldKey] = float(salesTotal)
        salesTotal = 0

    oldKey = thisKey
    salesTotal += float(thisSale)

if oldKey!= None:
    mydict[oldKey] = float(salesTotal)

maximum = max(mydict, key=mydict.get)
print(maximum, mydict[maximum])

minimum = min(mydict, key=mydict.get)
print(minimum, mydict[minimum])

适用于特定数据集的最高/最低温度的python hadoop代码

问题描述

2 个解决方案

解决方案1
0 2013-06-26 21:33:57

解决方案2
0 2016-02-16 13:19:05

适用于特定数据集的最高/最低温度的python hadoop代码

问题描述

2 个解决方案

解决方案1 0 2013-06-26 21:33:57

解决方案2 0 2016-02-16 13:19:05

解决方案1
0 2013-06-26 21:33:57

解决方案2
0 2016-02-16 13:19:05