[英]Hadoop MapReduce | SMA in python
I'm relatively new to Python and MapReduce. 我是Python和MapReduce的新手。 I'm trying to calculate Simple Moving Average(SMA) using Ta-Lib library in Python.
我正在尝试使用Python中的Ta-Lib库计算简单移动平均线(SMA)。 I have a data frame like this:
我有一个像这样的数据框:
AA BB
2008-03-05 36.60 36.60
2008-03-04 38.37 38.37
2008-03-03 38.71 38.71
2008-03-02 38.00 38.00
2008-03-01 38.32 38.32
2008-02-29 37.14 37.14
AA and BB are the stock symbols and their values for 6 days are shown. AA和BB是股票代码,并显示了6天的价值。
Can anyone help me out here ? 有人可以帮我从这里出去吗 ? What should be the map perform and what input should Reduce get ?
地图应该执行什么操作,Reduce应该获得什么输入?
The final output should print the SMAs for the stocks A and B. 最终输出应打印出库存A和B的SMA。
What is SMA(Simple Moving Average)? 什么是SMA(简单移动平均线)? A simple, or arithmetic, moving average that is calculated by adding the closing price of the security for a number of time periods and then dividing this total by the number of time periods.
一种简单的或算术的移动平均值,其计算方法是将多个时间段的有价证券的收盘价相加,然后将该总和除以时间段数。
For eg in the example above, the closing prices are: 37.14 (2008-02-29), 38.32 (2008-03-01), 38.00 (2008-03-02), 38.71 (2008-03-03), 38.37 (2008-03-04), 36.60 (2008-03-05). 例如,在上面的示例中,收盘价为:37.14(2008-02-29),38.32(2008-03-01),38.00(2008-03-02),38.71(2008-03-03),38.37( 2008-03-04),36.60(2008-03-05)。
So a 3-day SMA for 2008-03-02 is (37.14 + 38.32 + 38.00) / 3 = 37.82 There is no 3-day SMA for 2008-02-29 (because there is data for only 1 day: 2008-02-29) and no 3-day SMA for 2008-03-01 (there is data for only 2 days: 2008-02-29, 2008-03-01). 因此,2008-03-02的3天SMA为(37.14 + 38.32 + 38.00)/ 3 = 37.82 2008-02-29没有3天SMA(因为只有1天的数据:2008-02 -29)和2008年3月1日的3天均线(仅2天有数据:2008-02-29、2008-03-01)。
Following is the solution, for a 3-day SMA for your data (which you can easily change it to a 'n' day SMA). 以下是针对您的数据进行3天SMA的解决方案(您可以轻松地将其更改为“ n”天SMA)。
Mapper (m.py): 映射器(m.py):
import sys
for line in sys.stdin:
val = line.strip()
vals = val.split('\t')
print "%s\t%s:%s" % (vals[0], vals[1], vals[2])
Mapper Logic: It just reads the tab separated values in the line and outputs "{key}\\t{val1}:{val2}. 映射器逻辑:它仅读取行中制表符分隔的值并输出“ {key} \\ t {val1}:{val2}。
For eg for first line (tab separated values): 例如,对于第一行(制表符分隔的值):
2008-03-05 36.60 36.60
it outputs: 它输出:
2008-03-05 36.60:36.60
Reducer (r.py): 减速器(r.py):
import sys
lValueA = list()
lValueB = list()
smaInterval = 3
for line in sys.stdin:
(key, val) = line.strip().split('\t')
vals = val.split(':')
lValueA.append(float(vals[0]))
lValueB.append(float(vals[1]))
if len(lValueA) == smaInterval:
sumA = 0;
sumB = 0;
for a in lValueA:
sumA += a
for b in lValueB:
sumB += b
sumA = sumA / smaInterval;
sumB = sumB / smaInterval;
print "%s\t%.2f\t%.2f" % (key, sumA, sumB);
del lValueA[0]
del lValueB[0]
Reducer Logic: 减速器逻辑:
smaInterval = 3
) smaInterval = 3
) I executed this for your input. 我为您的输入执行了此命令。
I executed it, without using Hadoop as below (input.txt contains your input mentioned in the question, with tab separated values): 我执行了它,没有使用下面的Hadoop(input.txt包含问题中提到的输入,并用制表符分隔值):
cat input.txt | python m.py | sort | python r.py
I got the following output (which I verified to be correct): 我得到以下输出(我验证是正确的):
2008-03-02 37.82 37.82
2008-03-03 38.34 38.34
2008-03-04 38.36 38.36
2008-03-05 37.89 37.89
You should be able to execute the same, using Hadoop framework as: 使用Hadoop框架,您应该能够执行以下操作:
hadoop jar hadoop-streaming-2.7.1.jar -input {Input directory in HDFS} -output {Output directory in HDFS} -mapper {Path to the m.py} -reducer {Path to the r.py}
Note: This code can be optimized and may be, you do not need the reducer at all. 注意:此代码可以优化,并且可能完全不需要reducer。 If your data is small, on the mapper side itself you can read all the values, sort them and then compute SMA.
如果数据很小,则可以在映射器本身上读取所有值,对它们进行排序,然后计算SMA。 I just wrote this code, to illustrate computation of SMA using Hadoop streaming.
我只是编写了这段代码,以说明使用Hadoop流计算进行SMA的计算。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.