简体   繁体   English

在 PySpark 中加入和相乘 RDD

[英]Join and multiply RDDs in PySpark

I have two RDDs which I want to multiply by key.我有两个要按键相乘的 RDD。 That can be done either by merging the two RDDs and multiplying the elements or by multiplying the RDDs without merging them.这可以通过合并两个 RDD 并将元素相乘或通过将 RDD 相乘而不合并它们来完成。

Say I have these two RDDs:假设我有这两个 RDD:

rdd1 = [("dog", 2), ("ox", 4), ("cat", 1)]
rdd2 = [("dog", 9), ("ox", 2), ("cat", 2)]

What I want is:我想要的是:

multiplied_rdd = [("dog", 18), ("ox", 8), ("cat", 2)]

I tried merging the two RDDs, and then I would multiply the numbers, but I am getting an error:我尝试合并两个 RDD,然后将数字相乘,但出现错误:

merged_rdd = rdd1.join(rdd2)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

If I manage to get the merged RDD I would do:如果我设法获得合并的 RDD,我会这样做:

multiplied = merged_rdd.map(lambda x: (x[0], x[1][0]*x[1][1])

So, my question is how can I achieve the "multiplied_rdd" RDD either by joining or by multiplying externally rdd1 and rdd2?所以,我的问题是如何通过加入或通过外部 rdd1 和 rdd2 相乘来实现“multiplied_rdd”RDD?

Not sure how you are initializing RDDs (you have two rdd1 ), but this should work:不知道你是如何初始化 RDD(你有两个rdd1 ),但这应该可以工作:

from pyspark.sql import SparkSession

rdd1 = [("dog", 2), ("ox", 4), ("cat", 1)]
rdd2 = [("dog", 9), ("ox", 2), ("cat", 2)]

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
x = sc.parallelize(rdd1)
y = sc.parallelize(rdd2)
merged_rdd = x.join(y)
multiplied = merged_rdd.map(lambda x: (x[0], x[1][0] * x[1][1]))

print(multiplied.collect())
# [('ox', 8), ('dog', 18), ('cat', 2)]                                            

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM