简体   繁体   中英

Spark - How to convert map function output (Row,Row) tuple to one Dataframe

I need to write one scenario in Spark using Scala API. I am passing a user defined function to a Dataframe which processes each row of data frame one by one and returns tuple(Row, Row). How can i change RDD ( Row, Row) to Dataframe (Row)? See below code sample -

**Calling map function-**
    val df_temp = df_outPut.map { x => AddUDF.add(x,date1,date2)}
**UDF definition.**
    def add(x: Row,dates: String*): (Row,Row) = {
......................
........................
    var result1,result2:Row = Row()
..........
    return (result1,result2)

Now df_temp is a RDD(Row1, Row2). my requirement is to make it one RDD or Dataframe by breaking tuple elements to 1 record of RDD or Dataframe RDD(Row). Appreciate your help.

You can use flatMap to flatten your Row tuples, say if we start from this example rdd :

rddExample.collect()
// res37: Array[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = Array(([1,2],[3,4]), ([2,1],[4,2]))

val flatRdd = rddExample.flatMap{ case (x, y) => List(x, y) }
// flatRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[45] at flatMap at <console>:35

To convert it to data frame.

import org.apache.spark.sql.types.{StructType, StructField, IntegerType}

val schema = StructType(StructField("x", IntegerType, true)::
                        StructField("y", IntegerType, true)::Nil)    
val df = sqlContext.createDataFrame(flatRdd, schema)
df.show
+---+---+
|  x|  y|
+---+---+
|  1|  2|
|  3|  4|
|  2|  1|
|  4|  2|
+---+---+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM