[英]How to Merge Join Multiple DataFrames in Spark Scala Efficient Full Outer Join
如何有效地合並/加入多個Spark DataFrame(Scala)? 我想加入一個所有表共有的列,下面是“日期”,並因此得到(某種)稀疏數組。
Data Set A:
Date Col A1 Col A2
-----------------------
1/1/16 A11 A21
1/2/16 A12 A22
1/3/16 A13 A23
1/4/16 A14 A24
1/5/16 A15 A25
Data Set B:
Date Col B1 Col B2
-----------------------
1/1/16 B11 B21
1/3/16 B13 B23
1/5/16 B15 B25
Data Set C:
Date Col C1 Col C2
-----------------------
1/2/16 C12 C22
1/3/16 C13 C23
1/4/16 C14 C24
1/5/16 C15 C25
Expected Result Set:
Date Col A1 Col A2 Col B1 Col B2 Col C1 Col C2
---------------------------------------------------------
1/1/16 A11 A21 B11 B12
1/2/16 A12 A22 C12 C22
1/3/16 A13 A23 B13 B23 C13 C23
1/4/16 A14 A24 C14 C24
1/5/16 A15 A25 B15 B25 C15 C25
感覺就像是多個表上的完全外部聯接,但我不確定。 在DataFrames上沒有Join方法的情況下,是否有一些更簡單/更有效的方法來訪問此稀疏數組?
這是一個過時的文章,所以我不確定是否仍在調優OP。無論如何,一種簡單的方法來達到期望的結果是通過cogroup()
。 將每個RDD
轉換為以日期為鍵的[K,V] RDD
,然后使用cogroup。 這是一個例子:
def mergeFrames(sc: SparkContext, sqlContext: SQLContext) = {
import sqlContext.implicits._
// Create three dataframes. All string types assumed.
val dfa = sc.parallelize(Seq(A("1/1/16", "A11", "A21"),
A("1/2/16", "A12", "A22"),
A("1/3/16", "A13", "A23"),
A("1/4/16", "A14", "A24"),
A("1/5/16", "A15", "A25"))).toDF()
val dfb = sc.parallelize(Seq(
B("1/1/16", "B11", "B21"),
B("1/3/16", "B13", "B23"),
B("1/5/16", "B15", "B25"))).toDF()
val dfc = sc.parallelize(Seq(
C("1/2/16", "C12", "C22"),
C("1/3/16", "C13", "C23"),
C("1/4/16", "C14", "C24"),
C("1/5/16", "C15", "C25"))).toDF()
val rdda = dfa.rdd.map(row => row(0) -> row.toSeq.drop(1))
val rddb = dfb.rdd.map(row => row(0) -> row.toSeq.drop(1))
val rddc = dfc.rdd.map(row => row(0) -> row.toSeq.drop(1))
val schema = StructType("date a1 a2 b1 b2 c1 c2".split(" ").map(fieldName => StructField(fieldName, StringType)))
// Form cogroups. `date` is assumed to be a key so there's at most one row for each date in an rdd/df
val cg: RDD[Row] = rdda.cogroup(rddb, rddc).map { case (k, (v1, v2, v3)) =>
val cols = Seq(k) ++
(if (v1.nonEmpty) v1.head else Seq(null, null)) ++
(if (v2.nonEmpty) v2.head else Seq(null, null)) ++
(if (v3.nonEmpty) v3.head else Seq(null, null))
Row.fromSeq(cols)
}
// Turn RDD back to DataFrame
val cgdf = sqlContext.createDataFrame(cg, schema).sort("date")
cgdf.show }
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.