简体   繁体   中英

Hadoop MapReduce | SMA in python

I'm relatively new to Python and MapReduce. I'm trying to calculate Simple Moving Average(SMA) using Ta-Lib library in Python. I have a data frame like this:

             AA     BB  
2008-03-05  36.60  36.60  
2008-03-04  38.37  38.37  
2008-03-03  38.71  38.71  
2008-03-02  38.00  38.00
2008-03-01  38.32  38.32
2008-02-29  37.14  37.14     

AA and BB are the stock symbols and their values for 6 days are shown.
Can anyone help me out here ? What should be the map perform and what input should Reduce get ?

The final output should print the SMAs for the stocks A and B.

What is SMA(Simple Moving Average)? A simple, or arithmetic, moving average that is calculated by adding the closing price of the security for a number of time periods and then dividing this total by the number of time periods.

For eg in the example above, the closing prices are: 37.14 (2008-02-29), 38.32 (2008-03-01), 38.00 (2008-03-02), 38.71 (2008-03-03), 38.37 (2008-03-04), 36.60 (2008-03-05).

So a 3-day SMA for 2008-03-02 is (37.14 + 38.32 + 38.00) / 3 = 37.82 There is no 3-day SMA for 2008-02-29 (because there is data for only 1 day: 2008-02-29) and no 3-day SMA for 2008-03-01 (there is data for only 2 days: 2008-02-29, 2008-03-01).

Following is the solution, for a 3-day SMA for your data (which you can easily change it to a 'n' day SMA).

Mapper (m.py):

import sys
for line in sys.stdin:
    val = line.strip()
    vals = val.split('\t')
    print "%s\t%s:%s" % (vals[0], vals[1], vals[2])

Mapper Logic: It just reads the tab separated values in the line and outputs "{key}\\t{val1}:{val2}.

For eg for first line (tab separated values):

2008-03-05    36.60    36.60  

it outputs:

2008-03-05    36.60:36.60  

Reducer (r.py):

import sys

lValueA = list()
lValueB = list()

smaInterval = 3

for line in sys.stdin:
    (key, val) = line.strip().split('\t')

    vals = val.split(':')
    lValueA.append(float(vals[0]))
    lValueB.append(float(vals[1]))
    if len(lValueA) == smaInterval:     

        sumA = 0;
        sumB = 0;

        for a in lValueA:
            sumA += a
        for b in lValueB:
            sumB += b

        sumA = sumA / smaInterval;
        sumB = sumB / smaInterval;

        print "%s\t%.2f\t%.2f" % (key, sumA, sumB);
        del lValueA[0]
        del lValueB[0]

Reducer Logic:

  • It uses 2 lists. One for Stock A and one for Stock B.
  • It assumes that SMA interval is 3 ( smaInterval = 3 )
  • As and when a line of input comes in, it parses the line and appends value A and value B to their respective lists
  • When the size of any list reaches 3 (which is SMA interval), it computes the moving average and outputs, (key, SMA for Stock A, SMA for Stock B) and removes the zeroth element from each of the lists.

I executed this for your input.

I executed it, without using Hadoop as below (input.txt contains your input mentioned in the question, with tab separated values):

cat input.txt | python m.py | sort | python r.py

I got the following output (which I verified to be correct):

2008-03-02      37.82   37.82
2008-03-03      38.34   38.34
2008-03-04      38.36   38.36
2008-03-05      37.89   37.89

You should be able to execute the same, using Hadoop framework as:

hadoop jar hadoop-streaming-2.7.1.jar -input {Input directory in HDFS} -output {Output directory in HDFS} -mapper {Path to the m.py} -reducer {Path to the r.py}

Note: This code can be optimized and may be, you do not need the reducer at all. If your data is small, on the mapper side itself you can read all the values, sort them and then compute SMA. I just wrote this code, to illustrate computation of SMA using Hadoop streaming.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM