如何通過鍵列加入 RDD[Rating] 和 scala.collection.Map[Int, Double]？

Question

我有兩張桌子->

表 1：RDD[Rating] (rdd1,rdd2,rdd3)

和

表2：scala.collection.Map[Int, Double] (m1,m2)

我花了很多時間和精力試圖讓連接表像

（鍵（鍵 = rdd2 = m1），rdd3，m2）

但我總是有類型不匹配。

你能建議如何處理它嗎？ 我也嘗試將兩個表都轉換為一種類型，但我注意到肯定是不正確的方式......

Answer 1

基於你有一個RDD和一個Map ，你可以直接迭代你的RDD 。

假設Rating有 3 個字段（rdd1、rdd2、rdd3），讓我們將它們重命名為field1 、 field2和field3以使示例清晰並避免混淆。

給出這個示例輸入源：

case class Rating(field1: String, field2: Int, field3: String) // custom case class
val yourRDD = spark.sparkContext.parallelize(
  Seq(
    Rating("rating1", 1, "str1"), // item 1
    Rating("rating2", 2, "str2"), // item 2
    Rating("rating3", 3, "str3")  // item 3
  )
)
yourRDD.toDF.show() // to visualize()

這將 output 您的數據源，看起來像：

+-------+------+------+
| field1|field2|field3|
+-------+------+------+
|rating1|     1|  str1|
|rating2|     2|  str2|
|rating3|     3|  str3|
+-------+------+------+

同樣，您的 map 有以下示例數據：

val yourMap = Map(
  1 -> 1.111,
  2 -> 2.222,
  3 -> 3.333
)
println(yourMap)

map 上的數據：

yourMap: scala.collection.immutable.Map[Int,Double] = Map(1 -> 1.111, 2 -> 2.222, 3 -> 3.333)

然后，要“合並”，您只需要迭代您的RDD ，獲取您將用作key的值，在本例中為field2並將其用作map的key 。 像這樣的東西：

yourRDD
  .map(rating=>{ // iterate each item in your RDD
    val key = rating.field2 // get the value from the current item
    val valueFromMap = yourMap(key) // look for the value on the map using field2 as key - You need to handle null values in case that you wont have values for all the keys

    (key, rating.field3, valueFromMap) // generating an output for a new RDD that will be created based on this
}).toDF.show(truncate=false) // visualize the output

以上代碼將 output：

+---+----+-----+
|_1 |_2  |_3   |
+---+----+-----+
|1  |str1|1.111|
|2  |str2|2.222|
|3  |str3|3.333|
+---+----+-----+

希望這可以幫助

如何通過鍵列加入 RDD[Rating] 和 scala.collection.Map[Int, Double]？

問題描述

1 個解決方案

解決方案1
0 2019-11-01 21:37:51

如何通過鍵列加入 RDD[Rating] 和 scala.collection.Map[Int, Double]？

問題描述

1 個解決方案

解決方案1 0 2019-11-01 21:37:51

解決方案1
0 2019-11-01 21:37:51