简体   繁体   中英

How to Merge Join Multiple DataFrames in Spark Scala Efficient Full Outer Join

How do I merge / join multiple Spark DataFrames (Scala) efficiently? I want to join a column that is common to all tables, 'Date' below, and get (sort of) a sparse array as a result.

Data Set A:
Date    Col A1   Col A2
-----------------------
1/1/16  A11      A21
1/2/16  A12      A22
1/3/16  A13      A23
1/4/16  A14      A24
1/5/16  A15      A25

Data Set B:
Date    Col B1   Col B2
-----------------------
1/1/16  B11      B21
1/3/16  B13      B23
1/5/16  B15      B25

Data Set C:
Date    Col C1   Col C2
-----------------------
1/2/16  C12      C22
1/3/16  C13      C23
1/4/16  C14      C24
1/5/16  C15      C25

Expected Result Set:
Date    Col A1   Col A2  Col B1  Col B2  Col C1  Col C2
---------------------------------------------------------
1/1/16  A11      A21     B11     B12
1/2/16  A12      A22                     C12     C22
1/3/16  A13      A23     B13     B23     C13     C23
1/4/16  A14      A24                     C14     C24
1/5/16  A15      A25     B15     B25     C15     C25

This feels like a full outer join on multiple tables, but I am not sure. Is there some simpler / more efficient way to get to this sparse array without the Join method on DataFrames?

This is an old post so I'm not sure if the OP is still tuned in. Anyway, a simple way of achieving the desired result is via cogroup() . Turn each RDD into a [K,V] RDD with the date being the key, and then use cogroup. Here's an example:

def mergeFrames(sc: SparkContext, sqlContext: SQLContext) = {

import sqlContext.implicits._

// Create three dataframes. All string types assumed.
val dfa = sc.parallelize(Seq(A("1/1/16", "A11", "A21"),
  A("1/2/16", "A12", "A22"),
  A("1/3/16", "A13", "A23"),
  A("1/4/16", "A14", "A24"),
  A("1/5/16", "A15", "A25"))).toDF()

val dfb = sc.parallelize(Seq(
  B("1/1/16", "B11", "B21"),
  B("1/3/16", "B13", "B23"),
  B("1/5/16", "B15", "B25"))).toDF()

val dfc = sc.parallelize(Seq(
  C("1/2/16", "C12", "C22"),
  C("1/3/16", "C13", "C23"),
  C("1/4/16", "C14", "C24"),
  C("1/5/16", "C15", "C25"))).toDF()

val rdda = dfa.rdd.map(row => row(0) -> row.toSeq.drop(1))
val rddb = dfb.rdd.map(row => row(0) -> row.toSeq.drop(1))
val rddc = dfc.rdd.map(row => row(0) -> row.toSeq.drop(1))

val schema = StructType("date a1 a2 b1 b2 c1 c2".split(" ").map(fieldName => StructField(fieldName, StringType)))

// Form cogroups. `date` is assumed to be a key so there's at most one row for each date in an rdd/df
val cg: RDD[Row] = rdda.cogroup(rddb, rddc).map { case (k, (v1, v2, v3)) =>
  val cols = Seq(k) ++
    (if (v1.nonEmpty) v1.head else Seq(null, null)) ++
    (if (v2.nonEmpty) v2.head else Seq(null, null)) ++
    (if (v3.nonEmpty) v3.head else Seq(null, null))
  Row.fromSeq(cols)
}

// Turn RDD back to DataFrame
val cgdf = sqlContext.createDataFrame(cg, schema).sort("date")

cgdf.show }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM