简体   繁体   中英

Join two RDDs then group by another column

I have 2 RDDs with the the first having the format Code: string, Name: string and rdd2's format is Code: string, Year: string, Delay: float

rdd1 = [('a', 'name1'), ('b', 'name2')]
rdd2 = [('a', '2000', 1.25), ('a', '2000', 2.0), ('b', '2010', -1.0)]

I want to perform a join (on code ) so that I can group the data by name in order to do aggregates like count, average, minimum and maximum on delay .

I tried to flatten the values after performing a join like so:

joined = rdd1.join(rdd2).map(lambda (keys, values): (keys,) + values)

but it came up with error: missing 1 required positional argument.

My join result also only shows [('code', ('name', 'year'))] and doesn't include the delay values. How should I solve this?

This does not work in Python 3.x because support for Tuple Parameter Unpacking ( PEP-3113 ) was removed. Hence the TypeError.

Pair RDD joins work as key-value where

(a,b) joined (a, c) will give you (a, (b,c))

So, one way to make it work is this:

joined = rdd1.join(rdd2.map(lambda x: (x[0],x[1:])))
joined.map(lambda x: (x[0],)+ (x[1][0],) + x[1][1]).collect()
# Output
# [('b', 'name2', '2010', -1.0),
# ('a', 'name1', '2000', 1.25),
# ('a', 'name1', '2000', 2.0)]

You need to make sure that rdd2 is in the form of (key, value) before joining. Otherwise the elements after the second will be discarded.

rdd3 = rdd1.join(rdd2.map(lambda x: (x[0], (x[1], x[2]))))

rdd3.collect()
# [('b', ('name2', ('2010', -1.0))), ('a', ('name1', ('2000', 1.25))), ('a', ('name1', ('2000', 2.0)))]

If you want to remove the nested structures, you can add one more mapValues :

rdd3 = rdd1.join(rdd2.map(lambda x: (x[0], (x[1], x[2])))).mapValues(lambda x: (x[0], x[1][0], x[1][1]))

rdd3.collect()
# [('b', ('name2', '2010', -1.0)), ('a', ('name1', '2000', 1.25)), ('a', ('name1', '2000', 2.0))]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM