[英]Combine two rdds
我是新來的火花,有人可以幫助我找到一種方法來組合兩個rdds以根據scala中的以下邏輯創建最終的rdd,最好不使用sqlcontext(dataframes)-
RDD1 = column1,column2,column3擁有362825條記錄
RDD2 = column2_distinct(與RDD1相同,但包含不同的值),column4具有2621條記錄
最終RDD =第1欄,第2欄,第3欄,第4欄
例-
RDD1 =
userid | progid | Rating
a 001 5
b 001 3
b 002 4
c 003 2
RDD2 =
progid(distinct) | id
001 1
002 2
003 3
最終RDD =
userid | progid | id | rating
a 001 1 5
b 001 1 3
b 002 2 4
c 003 3 2
碼
val rawRdd1 = pairrdd1.map(x => x._1.split(",")(0) + "," + x._1.split(",")(1) + "," + x._2) //362825 records
val rawRdd2 = pairrdd2.map(x => x._1 + "," + x._2) //2621 records
val schemaString1 = "userid programid rating"
val schemaString2 = "programid id"
val fields1 = schemaString1.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
val fields2 = schemaString2.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema1 = StructType(fields1)
val schema2 = StructType(fields2)
val rowRDD1 = rawRdd1.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1), attributes(2)))
val rowRDD2 = rawRdd2.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1)))
val DF1 = sparkSession.createDataFrame(rowRDD1, schema1)
val DF2 = sparkSession.createDataFrame(rowRDD2, schema2)
DF1.createOrReplaceTempView("df1")
DF2.createOrReplaceTempView("df2")
val resultDf = DF1.join(DF2, Seq("programid"))
val DF3 = sparkSession.sql("""SELECT df1.userid, df1.programid, df2.id, df1.rating FROM df1 JOIN df2 on df1.programid == df2.programid""")
println(DF1.count()) //362825 records
println(DF2.count()) //2621 records
println(DF3.count()) //only 297 records
期望與DF1相同的記錄數,並且從DF2(id)附加了新列,並具有來自DF2的programid的對應值
這有點難看,但應該可以使用(Spark 2.0):
val rdd1 = sparkSession.sparkContext.parallelize(List("a,001,5", "b,001,3", "b,002,4","c,003,2"))
val rdd2 = sparkSession.sparkContext.parallelize(List("001,1", "002,2", "003,3"))
val groupedRDD1 = rdd1.map(x => (x.split(",")(1),x))
val groupedRDD2 = rdd2.map(x => (x.split(",")(0),x))
val joinRDD = groupedRDD1.join(groupedRDD2)
// convert back to String
val cleanJoinRDD = joinRDD.map(x => x._1 + "," + x._2._1.replace(x._1 + ",","") + "," + x._2._2.replace(x._1 + ",",""))
cleanJoinRDD.collect().foreach(println)
我認為更好的選擇是使用Spark SQL
首先,為什么要再次拆分,連接和拆分行? 您可以一步一步完成:
val rowRdd1 = pairrdd1.map{x =>
val (userid, progid) = x._1.split(",")
val rating = x._2
Row(userid, progid, rating)
}
我的猜測是,您的問題可能是您的鍵中還有一些其他字符,因此聯接中的字符不匹配。 一種簡單的方法是進行left join
並檢查不匹配的行。
這可能是行中多余的空間,您可以為兩個rdds這樣修復:
val rowRdd1 = pairrdd1.map{x =>
val (userid, progid) = x._1.split(",").map(_.trim)
val rating = x._2
Row(userid, progid, rating)
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.