簡體   English   中英

org.apache.spark.rdd.RDD [(((String,Double),(String,Double)))]到Scala中的數據框

[英]org.apache.spark.rdd.RDD[((String, Double), (String, Double))] to Dataframe in Scala

我正在學習Scala / Spark。 在Scala中,很少有groupby運算產生以下rdd。 現在,我正在嘗試將以下內容寫入sql dataframe並將其保存在hadoop中。 但是在將其寫入sql數據框時將其轉換為

示例RDD格式:

Array[((String, Double), (String, Double))] = Array(((Veterans Affairs Dept of,11669.0),(Veterans Affairs Dept of,101124.0)), ((Office Wisc Public Defender,40728.0),(Office Wisc Public Defender,40728.0)))

直接使用.toDF給

 |                  _1|                  _2|
 +--------------------+--------------------+
 |[Veterans Affairs...|[Veterans Affairs...|
 |[Office Wisc Publ...|[Office Wisc Publ...| 
 |[Health Services,...|[Health Services,...|

我該怎么做才能獲得以下所示格式的上述結果:

|                  _1|                  _2|_3|
+--------------------+--------------------+-----+
|[Veterans Affairs...|11669.0|101124|
|[Office Wisc Publ...|40728|40728|

由於您使用了groupBy操作,因此我將假設Array[((String,Double),(String,Double))]中的兩個字符串相同。 如果是這樣,則可以嘗試以下操作:

val myRDD=Array[((String,Double),(String,Double))]

val strings = myRDD.map(a=>a._1._1)

val values = myRDD.map(a=>(a._1._2,a._2._2))

val rows = strings.zip(values)

val rowsDF=rows.map{case (a,b)=>(a,b._1,b._2)}.toDF

例如,考慮以下偽數據

val myRDD=sc.parallelize(Array((("string1",1.0),("string1",2.0)),(("string2",3.0),("string2",4.0))))

myRDD: org.apache.spark.rdd.RDD[((String, Double), (String, Double))] = ParallelCollectionRDD[33] at parallelize at <console>:27

輸出將是

scala> rowsDF: org.apache.spark.sql.DataFrame = [_1: string, _2: double, _3: double]
scala> rowsDF.collect()
res49: Array[org.apache.spark.sql.Row] = Array([string1,1.0,2.0], [string2,3.0,4.0])

如果row._1._1字符串等於row._2._1字符串,則您的RDD((String, Double), (String, Double))並且想要轉換為RDD(String, Double, Double)

val input: Array[((String, Double), (String, Double))] =
    Array((("Veterans Affairs Dept of", 11669.0), ("Veterans Affairs Dept of", 101124.0)),
      (("Office Wisc Public Defender", 40728.0), ("Office Wisc Public Defender", 40728.0)))

輸入RDD[((String, Double), (String, Double))]

val myRDD: RDD[((String, Double), (String, Double))] = sc.parallelize(input)

使用flatMap轉換為RDD[(String, Double, Double)]

val resultRDD: RDD[(String, Double, Double)] =
    myRDD.flatMap(row => row._1._1 match {
      case firstString if firstString == row._2._1 =>
        Some((firstString, row._1._2, row._2._2))
      case _ => None
    })

將RDD隱藏到數據幀中。

resultRDD.toDF().show()

結果:

+--------------------+-------+--------+
|                  _1|     _2|      _3|
+--------------------+-------+--------+
|Veterans Affairs ...|11669.0|101124.0|
|Office Wisc Publi...|40728.0| 40728.0|
+--------------------+-------+--------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM