How to make this code more efficient in Spark?
I need to calculate minimum, maximum, count, mean from data.
Here is my sample data,
Name Shop Money
A Shop001 99.99
A Shop001 87.15
B Shop001 3.99
...
Now I try to organize my data to generate mean, min, max, count by Name+Shop (key).
Then get the result by collect().
Here is my code in spark,
def tupleDivide(y):
return float(y[0])/y[1]
def smin(a, b):
return min(a, b)
def smax(a, b):
return max(a, b)
raw = sgRDD.map(lambda x: getVar(parserLine(x),list_C+list_N)).cache()
cnt = raw.map(lambda (x,y,z): (x+"_"+y, 1)).countByKey()
sum = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(add)
min = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smin)
max = raw.map(lambda (x,y,z): (x+"_"+y, z)).reduceByKey(smax)
raw_cntRDD = sc.parallelize(cnt.items(),3)
raw_mean = sum.join(raw_cntRDD).map(lambda (x, y): (x, tupleDivide(y)))
Would anyone provide some suggestion about the elegant coding style?
Thanks!
You should use aggregateByKey
for more optimal processing. The idea is that you store state
vector which consists of count, min, max, and sum, and use aggregation functions to get the final values. Also, you can use tuple as a key, it is not necessary to concatenate keys into a single string.
data = [
['x', 'shop1', 1],
['x', 'shop1', 2],
['x', 'shop2', 3],
['x', 'shop2', 4],
['x', 'shop3', 5],
['y', 'shop4', 6],
['y', 'shop4', 7],
['y', 'shop4', 8]
]
def add(state, x):
state[0] += 1
state[1] = min(state[1], x)
state[2] = max(state[2], x)
state[3] += x
return state
def merge(state1, state2):
state1[0] += state2[0]
state1[1] = min(state1[1], state2[1])
state1[2] = max(state1[2], state2[2])
state1[3] += state2[3]
return state1
res = sc.parallelize(data).map(lambda x: ((x[0], x[1]), x[2])).aggregateByKey([0, 10000, 0, 0], add, merge)
for x in res.collect():
print 'Client "%s" shop "%s" : count %d min %f max %f avg %f' % (
x[0][0], x[0][1],
x[1][0], x[1][1], x[1][2], float(x[1][3])/float(x[1][0])
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.