![](/img/trans.png)
[英]Convert RDD from type `org.apache.spark.rdd.RDD[((String, String), Double)]` to `org.apache.spark.rdd.RDD[((String), List[Double])]`
[英]org.apache.spark.rdd.RDD[((String, Double), (String, Double))] to Dataframe in Scala
我正在學習Scala / Spark。 在Scala中,很少有groupby運算產生以下rdd。 現在,我正在嘗試將以下內容寫入sql dataframe並將其保存在hadoop中。 但是在將其寫入sql數據框時將其轉換為
示例RDD格式:
Array[((String, Double), (String, Double))] = Array(((Veterans Affairs Dept of,11669.0),(Veterans Affairs Dept of,101124.0)), ((Office Wisc Public Defender,40728.0),(Office Wisc Public Defender,40728.0)))
直接使用.toDF給
| _1| _2|
+--------------------+--------------------+
|[Veterans Affairs...|[Veterans Affairs...|
|[Office Wisc Publ...|[Office Wisc Publ...|
|[Health Services,...|[Health Services,...|
我該怎么做才能獲得以下所示格式的上述結果:
| _1| _2|_3|
+--------------------+--------------------+-----+
|[Veterans Affairs...|11669.0|101124|
|[Office Wisc Publ...|40728|40728|
由於您使用了groupBy操作,因此我將假設Array[((String,Double),(String,Double))]
中的兩個字符串相同。 如果是這樣,則可以嘗試以下操作:
val myRDD=Array[((String,Double),(String,Double))]
val strings = myRDD.map(a=>a._1._1)
val values = myRDD.map(a=>(a._1._2,a._2._2))
val rows = strings.zip(values)
val rowsDF=rows.map{case (a,b)=>(a,b._1,b._2)}.toDF
例如,考慮以下偽數據
val myRDD=sc.parallelize(Array((("string1",1.0),("string1",2.0)),(("string2",3.0),("string2",4.0))))
myRDD: org.apache.spark.rdd.RDD[((String, Double), (String, Double))] = ParallelCollectionRDD[33] at parallelize at <console>:27
輸出將是
scala> rowsDF: org.apache.spark.sql.DataFrame = [_1: string, _2: double, _3: double]
scala> rowsDF.collect()
res49: Array[org.apache.spark.sql.Row] = Array([string1,1.0,2.0], [string2,3.0,4.0])
如果row._1._1
字符串等於row._2._1
字符串,則您的RDD
為((String, Double), (String, Double))
並且想要轉換為RDD
為(String, Double, Double)
。
val input: Array[((String, Double), (String, Double))] =
Array((("Veterans Affairs Dept of", 11669.0), ("Veterans Affairs Dept of", 101124.0)),
(("Office Wisc Public Defender", 40728.0), ("Office Wisc Public Defender", 40728.0)))
輸入RDD[((String, Double), (String, Double))]
val myRDD: RDD[((String, Double), (String, Double))] = sc.parallelize(input)
使用flatMap
轉換為RDD[(String, Double, Double)]
。
val resultRDD: RDD[(String, Double, Double)] =
myRDD.flatMap(row => row._1._1 match {
case firstString if firstString == row._2._1 =>
Some((firstString, row._1._2, row._2._2))
case _ => None
})
將RDD隱藏到數據幀中。
resultRDD.toDF().show()
結果:
+--------------------+-------+--------+
| _1| _2| _3|
+--------------------+-------+--------+
|Veterans Affairs ...|11669.0|101124.0|
|Office Wisc Publi...|40728.0| 40728.0|
+--------------------+-------+--------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.