简体   繁体   中英

Join and multiply RDDs in PySpark

I have two RDDs which I want to multiply by key. That can be done either by merging the two RDDs and multiplying the elements or by multiplying the RDDs without merging them.

Say I have these two RDDs:

rdd1 = [("dog", 2), ("ox", 4), ("cat", 1)]
rdd2 = [("dog", 9), ("ox", 2), ("cat", 2)]

What I want is:

multiplied_rdd = [("dog", 18), ("ox", 8), ("cat", 2)]

I tried merging the two RDDs, and then I would multiply the numbers, but I am getting an error:

merged_rdd = rdd1.join(rdd2)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

If I manage to get the merged RDD I would do:

multiplied = merged_rdd.map(lambda x: (x[0], x[1][0]*x[1][1])

So, my question is how can I achieve the "multiplied_rdd" RDD either by joining or by multiplying externally rdd1 and rdd2?

Not sure how you are initializing RDDs (you have two rdd1 ), but this should work:

from pyspark.sql import SparkSession

rdd1 = [("dog", 2), ("ox", 4), ("cat", 1)]
rdd2 = [("dog", 9), ("ox", 2), ("cat", 2)]

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
x = sc.parallelize(rdd1)
y = sc.parallelize(rdd2)
merged_rdd = x.join(y)
multiplied = merged_rdd.map(lambda x: (x[0], x[1][0] * x[1][1]))

print(multiplied.collect())
# [('ox', 8), ('dog', 18), ('cat', 2)]                                            

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM