简体   繁体   English

Hadoop MapReduce | python中的SMA

[英]Hadoop MapReduce | SMA in python

I'm relatively new to Python and MapReduce. 我是Python和MapReduce的新手。 I'm trying to calculate Simple Moving Average(SMA) using Ta-Lib library in Python. 我正在尝试使用Python中的Ta-Lib库计算简单移动平均线(SMA)。 I have a data frame like this: 我有一个像这样的数据框:

             AA     BB  
2008-03-05  36.60  36.60  
2008-03-04  38.37  38.37  
2008-03-03  38.71  38.71  
2008-03-02  38.00  38.00
2008-03-01  38.32  38.32
2008-02-29  37.14  37.14     

AA and BB are the stock symbols and their values for 6 days are shown. AA和BB是股票代码,并显示了6天的价值。
Can anyone help me out here ? 有人可以帮我从这里出去吗 ? What should be the map perform and what input should Reduce get ? 地图应该执行什么操作,Reduce应该获得什么输入?

The final output should print the SMAs for the stocks A and B. 最终输出应打印出库存A和B的SMA。

What is SMA(Simple Moving Average)? 什么是SMA(简单移动平均线)? A simple, or arithmetic, moving average that is calculated by adding the closing price of the security for a number of time periods and then dividing this total by the number of time periods. 一种简单的或算术的移动平均值,其计算方法是将多个时间段的有价证券的收盘价相加,然后将该总和除以时间段数。

For eg in the example above, the closing prices are: 37.14 (2008-02-29), 38.32 (2008-03-01), 38.00 (2008-03-02), 38.71 (2008-03-03), 38.37 (2008-03-04), 36.60 (2008-03-05). 例如,在上面的示例中,收盘价为:37.14(2008-02-29),38.32(2008-03-01),38.00(2008-03-02),38.71(2008-03-03),38.37( 2008-03-04),36.60(2008-03-05)。

So a 3-day SMA for 2008-03-02 is (37.14 + 38.32 + 38.00) / 3 = 37.82 There is no 3-day SMA for 2008-02-29 (because there is data for only 1 day: 2008-02-29) and no 3-day SMA for 2008-03-01 (there is data for only 2 days: 2008-02-29, 2008-03-01). 因此,2008-03-02的3天SMA为(37.14 + 38.32 + 38.00)/ 3 = 37.82 2008-02-29没有3天SMA(因为只有1天的数据:2008-02 -29)和2008年3月1日的3天均线(仅2天有数据:2008-02-29、2008-03-01)。

Following is the solution, for a 3-day SMA for your data (which you can easily change it to a 'n' day SMA). 以下是针对您的数据进行3天SMA的解决方案(您可以轻松地将其更改为“ n”天SMA)。

Mapper (m.py): 映射器(m.py):

import sys
for line in sys.stdin:
    val = line.strip()
    vals = val.split('\t')
    print "%s\t%s:%s" % (vals[0], vals[1], vals[2])

Mapper Logic: It just reads the tab separated values in the line and outputs "{key}\\t{val1}:{val2}. 映射器逻辑:它仅读取行中制表符分隔的值并输出“ {key} \\ t {val1}:{val2}。

For eg for first line (tab separated values): 例如,对于第一行(制表符分隔的值):

2008-03-05    36.60    36.60  

it outputs: 它输出:

2008-03-05    36.60:36.60  

Reducer (r.py): 减速器(r.py):

import sys

lValueA = list()
lValueB = list()

smaInterval = 3

for line in sys.stdin:
    (key, val) = line.strip().split('\t')

    vals = val.split(':')
    lValueA.append(float(vals[0]))
    lValueB.append(float(vals[1]))
    if len(lValueA) == smaInterval:     

        sumA = 0;
        sumB = 0;

        for a in lValueA:
            sumA += a
        for b in lValueB:
            sumB += b

        sumA = sumA / smaInterval;
        sumB = sumB / smaInterval;

        print "%s\t%.2f\t%.2f" % (key, sumA, sumB);
        del lValueA[0]
        del lValueB[0]

Reducer Logic: 减速器逻辑:

  • It uses 2 lists. 它使用2个列表。 One for Stock A and one for Stock B. 一个用于库存A,一个用于库存B。
  • It assumes that SMA interval is 3 ( smaInterval = 3 ) 假设SMA间隔为3( smaInterval = 3
  • As and when a line of input comes in, it parses the line and appends value A and value B to their respective lists 当输入行进入时,它将解析该行并将值A和值B附加到它们各自的列表中
  • When the size of any list reaches 3 (which is SMA interval), it computes the moving average and outputs, (key, SMA for Stock A, SMA for Stock B) and removes the zeroth element from each of the lists. 当任何列表的大小达到3(这是SMA间隔)时,它将计算移动平均值和输出(键,股票A的SMA,股票B的SMA),并从每个列表中删除第零个元素。

I executed this for your input. 我为您的输入执行了此命令。

I executed it, without using Hadoop as below (input.txt contains your input mentioned in the question, with tab separated values): 我执行了它,没有使用下面的Hadoop(input.txt包含问题中提到的输入,并用制表符分隔值):

cat input.txt | python m.py | sort | python r.py

I got the following output (which I verified to be correct): 我得到以下输出(我验证是正确的):

2008-03-02      37.82   37.82
2008-03-03      38.34   38.34
2008-03-04      38.36   38.36
2008-03-05      37.89   37.89

You should be able to execute the same, using Hadoop framework as: 使用Hadoop框架,您应该能够执行以下操作:

hadoop jar hadoop-streaming-2.7.1.jar -input {Input directory in HDFS} -output {Output directory in HDFS} -mapper {Path to the m.py} -reducer {Path to the r.py}

Note: This code can be optimized and may be, you do not need the reducer at all. 注意:此代码可以优化,并且可能完全不需要reducer。 If your data is small, on the mapper side itself you can read all the values, sort them and then compute SMA. 如果数据很小,则可以在映射器本身上读取所有值,对它们进行排序,然后计算SMA。 I just wrote this code, to illustrate computation of SMA using Hadoop streaming. 我只是编写了这段代码,以说明使用Hadoop流计算进行SMA的计算。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM