简体   繁体   中英

How to optimize this code on spark?

How to make this code more efficient in Spark?
I need to calculate minimum, maximum, count, mean from data.
Here is my sample data,

Name Shop Money
A Shop001 99.99
A Shop001 87.15
B Shop001 3.99
...

Now I try to organize my data to generate mean, min, max, count by Name+Shop (key).
Then get the result by collect().
Here is my code in spark,

def tupleDivide(y): return float(y[0])/y[1] def smin(a, b): return min(a, b) def smax(a, b): return max(a, b) raw = sgRDD.map(lambda x: getVar(parserLine(x),list_C+list_N)).cache() cnt = raw.map(lambda (x,y,z): (x+"_"+y, 1)).countByKey() sum = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(add) min = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smin) max = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smax) raw_cntRDD = sc.parallelize(cnt.items(),3) raw_mean = sum.join(raw_cntRDD).map(lambda (x, y): (x, tupleDivide(y)))

Would anyone provide some suggestion about the elegant coding style?
Thanks!

You should use aggregateByKey for more optimal processing. The idea is that you store state vector which consists of count, min, max, and sum, and use aggregation functions to get the final values. Also, you can use tuple as a key, it is not necessary to concatenate keys into a single string.

data = [
        ['x', 'shop1', 1],
        ['x', 'shop1', 2],
        ['x', 'shop2', 3],
        ['x', 'shop2', 4],
        ['x', 'shop3', 5],
        ['y', 'shop4', 6],
        ['y', 'shop4', 7],
        ['y', 'shop4', 8]
    ]

def add(state, x):
    state[0] += 1
    state[1] = min(state[1], x)
    state[2] = max(state[2], x)
    state[3] += x
    return state

def merge(state1, state2):
    state1[0] += state2[0]
    state1[1] = min(state1[1], state2[1])
    state1[2] = max(state1[2], state2[2])
    state1[3] += state2[3]
    return state1

res = sc.parallelize(data).map(lambda x: ((x[0], x[1]), x[2])).aggregateByKey([0, 10000, 0, 0], add, merge)

for x in res.collect():
    print 'Client "%s" shop "%s" : count %d min %f max %f avg %f' % (
        x[0][0], x[0][1],
        x[1][0], x[1][1], x[1][2], float(x[1][3])/float(x[1][0])
    )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM