How to change a global variable value inside a task map or reduce in Apache Spark using Python

Question

I have the following code:

import sys
from pyspark import SparkContext

def mapper(array):
    aux = []
    array = str(array)
    aux = array.split(' | ')
    return {(aux[0][:-1],aux[1][:-1]): [(aux[0][1:],aux[1][1:])]}

def reducer(d1, d2):
    for k in d1.keys():
        if d2.has_key(k):
            d1[k] = d1[k] + d2[k]
            d2.pop(k)
    d1.update(d2)
    return d1 

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: bruijn <file>")
        exit(-1)
    sc = SparkContext(appName="Assembler")
    kd = sys.argv[1].lstrip('k').rstrip('mer.txt').split('d')
    k, d = int(kd[0]), int(kd[1])
    dic = sc.textFile(sys.argv[1],False).map(mapper).reduce(reducer)
    filepath = open('DeBruijn.txt', 'w')
    for key in sorted(dic):
        filepath.write(str(key) + ' -> ' + str(dic[key]) + '\n')
    filepath.close()        
    print('De Bruijn graph successfully generated!')
    sc.stop()

I would like to create an empty list called vertexes inside the main and make the mapper append elements inside it. However using the keyword global does not work. I have tried using an accumulator, but accumulators' values cannot be acessed inside tasks.

Answer 1

I figured it out how to do it by creating a custom type of Accumulatior that works with lists. In my code all I had to do was to insert the following import and implement the following class:

from pyspark.accumulators import AccumulatorParam

class VectorAccumulatorParam(AccumulatorParam):
    def zero(self, value):
        return []
    def addInPlace(self, val1, val2):
        return val1 + [val2] if type(val2) != list else val2 #Had to do this check because without it the result would be a list with all the tuples inside of another list.

My mapper function would be like this:

def mapper(array):
    global vertexes
    aux = []
    array = str(array)
    aux = array.split(' | ')
    vertexes += (aux[0][:-1], aux[1][:-1]) #Adding a tuple into accumulator
    vertexes += (aux[0][1:], aux[1][1:]) #Adding a tuple into accumulator
    return {(aux[0][:-1],aux[1][:-1]): [(aux[0][1:],aux[1][1:])]

And inside the main function before calling the mapper function I created the accumulator:

vertexes = sc.accumulator([],VectorAccumulatorParam())

After the mapper/reducer function calls, I could get the result:

vertexes = list(set(vertexes.value))

Answer 2

Herio Sousa's VectorAccumulatorParam is a good idea. However, you can actually use built-in class AddingAccumulatorParam, which is basically the same to VectorAccumulatorParam.

Check out the original code here https://github.com/apache/spark/blob/41afa16500e682475eaa80e31c0434b7ab66abcb/python/pyspark/accumulators.py#L197-L213

Answer 3

As you've noticed, you can't append elements inside of the mapper (or well you can append the elements inside of the mapper, but the change is not propegated to any of the other mappers or your main function). As you've noticed accumulators do allow you to append elements, however they can only be read in the driver program and written to in the executors. You could have another mapper output the keys and call distinct on it if you want the distinct keys. You might also want to look at reduceByKey instead of the reduce you are using.

How to change a global variable value inside a task map or reduce in Apache Spark using Python

Question

3 answers

solution1
2 ACCPTED 2015-06-23 04:42:05

solution2
1 2016-05-19 19:57:57

solution3
0 2015-06-23 01:56:25

How to change a global variable value inside a task map or reduce in Apache Spark using Python

Question

3 answers

solution1 2 ACCPTED 2015-06-23 04:42:05

solution2 1 2016-05-19 19:57:57

solution3 0 2015-06-23 01:56:25

solution1
2 ACCPTED 2015-06-23 04:42:05

solution2
1 2016-05-19 19:57:57

solution3
0 2015-06-23 01:56:25