如何在任務圖中更改全局變量值或使用Python在Apache Spark中減少

Question

我有以下代碼：

import sys
from pyspark import SparkContext

def mapper(array):
    aux = []
    array = str(array)
    aux = array.split(' | ')
    return {(aux[0][:-1],aux[1][:-1]): [(aux[0][1:],aux[1][1:])]}

def reducer(d1, d2):
    for k in d1.keys():
        if d2.has_key(k):
            d1[k] = d1[k] + d2[k]
            d2.pop(k)
    d1.update(d2)
    return d1 

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: bruijn <file>")
        exit(-1)
    sc = SparkContext(appName="Assembler")
    kd = sys.argv[1].lstrip('k').rstrip('mer.txt').split('d')
    k, d = int(kd[0]), int(kd[1])
    dic = sc.textFile(sys.argv[1],False).map(mapper).reduce(reducer)
    filepath = open('DeBruijn.txt', 'w')
    for key in sorted(dic):
        filepath.write(str(key) + ' -> ' + str(dic[key]) + '\n')
    filepath.close()        
    print('De Bruijn graph successfully generated!')
    sc.stop()

我想在main中創建一個名為vertexes的空列表，並使映射器在其中附加元素。 但是，使用關鍵字global不起作用。 我嘗試過使用累加器，但累加器的值不能在任務中獲取。

Answer 1

我想通過創建一個與列表一起使用的自定義類型的Accumulati來弄清楚如何做到這一點。 在我的代碼中，我所要做的就是插入以下導入並實現以下類：

from pyspark.accumulators import AccumulatorParam

class VectorAccumulatorParam(AccumulatorParam):
    def zero(self, value):
        return []
    def addInPlace(self, val1, val2):
        return val1 + [val2] if type(val2) != list else val2 #Had to do this check because without it the result would be a list with all the tuples inside of another list.

我的mapper函數將是這樣的：

def mapper(array):
    global vertexes
    aux = []
    array = str(array)
    aux = array.split(' | ')
    vertexes += (aux[0][:-1], aux[1][:-1]) #Adding a tuple into accumulator
    vertexes += (aux[0][1:], aux[1][1:]) #Adding a tuple into accumulator
    return {(aux[0][:-1],aux[1][:-1]): [(aux[0][1:],aux[1][1:])]

在調用mapper函數之前在main函數內部我創建了累加器：

vertexes = sc.accumulator([],VectorAccumulatorParam())

mapper / reducer函數調用后，我可以得到結果：

vertexes = list(set(vertexes.value))

Answer 2

Herio Sousa的VectorAccumulatorParam是一個好主意。 但是，您實際上可以使用內置類AddingAccumulatorParam，它與VectorAccumulatorParam基本相同。

在這里查看原始代碼https://github.com/apache/spark/blob/41afa16500e682475eaa80e31c0434b7ab66abcb/python/pyspark/accumulators.py#L197-L213

Answer 3

正如您所注意到的那樣，您無法將元素附加到映射器內（或者您可以將元素附加到映射器內部，但更改不會推廣到任何其他映射器或您的主函數）。 正如您已經注意到累加器允許您附加元素，但是它們只能在驅動程序中讀取並寫入執行程序中。 如果你想要不同的密鑰，你可以讓另一個映射器輸出密鑰並在其上調用distinct。 您可能還想查看reduceByKey而不是您正在使用的reduce。

如何在任務圖中更改全局變量值或使用Python在Apache Spark中減少

問題描述

3 個解決方案

解決方案1
2 已采納 2015-06-23 04:42:05

解決方案2
1 2016-05-19 19:57:57

解決方案3
0 2015-06-23 01:56:25

如何在任務圖中更改全局變量值或使用Python在Apache Spark中減少

問題描述

3 個解決方案

解決方案1 2 已采納 2015-06-23 04:42:05

解決方案2 1 2016-05-19 19:57:57

解決方案3 0 2015-06-23 01:56:25

解決方案1
2 已采納 2015-06-23 04:42:05

解決方案2
1 2016-05-19 19:57:57

解決方案3
0 2015-06-23 01:56:25