My RDD is made of many items, each of which is a tuple as follows:
(key1, (val1_key1, val2_key1))
(key2, (val1_key2, val2_key2))
(key1, (val1_again_key1, val2_again_key1))
... and so on
I used GroupByKey on the RDD which gave the result as
(key1, [(val1_key1, val2_key1), (val1_again_key1, val2_again_key1), (), ... ()])
(key2, [(val1_key2, val2_key2), (), () ... ())])
... and so on
I need to do the same using reduceByKey. I tried doing
RDD.reduceByKey(lambda val1, val2: list(val1).append(val2))
but it doesn't work.
Please suggest the right way to implement using reduceByKey()
The answer is you cannot (or at least not in a straightforward and Pythonic way without abusing language dynamism). Since values type and return type are different (a list of tuples vs a single tuple) reduce
is not a valid function here. You could use combineByKey
or aggregateByKey
for example like this:
rdd = sc.parallelize([
("key1", ("val1_key1", "val2_key1")),
("key2", ("val1_key2", "val2_key2"))])
rdd.aggregateByKey([], lambda acc, x: acc + [x], lambda acc1, acc2: acc1 + acc2)
but it is just a less efficient version of groupByKey
. See also Is groupByKey ever preferred over reduceByKey
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.