[英]3 LEFT-JOIN in Spark SQL with API JAVA
I have 3 dataset in origin from 3 tables:我有来自 3 个表的 3 个数据集:
Dataset<TABLE1> bbdd_one = map.get("TABLE1").as(Encoders.bean(TABLE1.class)).alias("TABLE1");
Dataset<TABLE2> bbdd_two = map.get("TABLE2").as(Encoders.bean(TABLE2.class)).alias("TABLE2");
Dataset<TABLE3> bbdd_three = map.get("TABLE3").as(Encoders.bean(TABLE3.class)).alias("TABLE3");
and I want to do a triple left-join on it and write it in an output.parquet我想对其进行三重左连接并将其写入 output.parquet
The sql JOIN statement is similar to this: sql JOIN语句类似这样:
SELECT one.field, ........, two.field ....., three.field, ... four.field
FROM TABLE1 one
LEFT JOIN TABLE2 two ON two.field = one.field
LEFT JOIN TABLE3 three ON three.field = one.field AND three.field = one.field
LEFT JOIN TABLE3 four ON four.field = one.field AND four.field = one.otherfield
WHERE one.field = 'whatever'
How can do this with JAVA API?如何用 JAVA API 做到这一点? Is it possible?
可能吗? I did an example with only one join but with 3 seems difficult.
我做了一个只有一个连接的例子,但有 3 个似乎很难。
PS: My other join with JAVA API is: PS:我与 JAVA API 的另一个连接是:
Dataset<TJOINED> ds_joined = ds_table1
.join(ds_table2,
JavaConversions.asScalaBuffer(Arrays.asList("fieldInCommon1", "fieldInCommon2", "fieldInCommon3", "fieldInCommon4"))
.seq(),
"inner")
.select("a lot of fields", ... "more fields")
.as(Encoders.bean(TJOINED.class));
Thanks!谢谢!
Have you tried chaining join statements?您是否尝试过链接连接语句? I don't often code in Java so this is just a guess
我不经常在 Java 中编码,所以这只是一个猜测
Dataset<TJOINED> ds_joined = ds_table1
.join(
ds_table2,
JavaConversions.asScalaBuffer(Arrays.asList(...)).seq(),
"left"
)
.join(
ds_table3,
JavaConversions.asScalaBuffer(Arrays.asList(...)).seq(),
"left"
)
.join(
ds_table4,
JavaConversions.asScalaBuffer(Arrays.asList(...)).seq(),
"left"
)
.select(...)
.as(Encoders.bean(TJOINED.class))
Update: If my understanding is correct, ds_table3
and ds_table4
are the same and they are joined on different field.更新:如果我的理解是正确的,
ds_table3
和ds_table4
是相同的,它们在不同的字段上连接。 Then maybe this updated answer, which is given in Scala since it's what I'm used to working with, might achieve what you want.然后也许这个更新的答案,在 Scala 中给出,因为它是我习惯使用的,可能会实现你想要的。 Here's the full working example:
这是完整的工作示例:
import spark.implicits._
case class TABLE1(f1: Int, f2: Int, f3: Int, f4: Int, f5:Int)
case class TABLE2(f1: Int, f2: Int, vTable2: Int)
case class TABLE3(f3: Int, f4: Int, vTable3: Int)
val one = spark.createDataset[TABLE1](Seq(TABLE1(1,2,3,4,5), TABLE1(1,3,4,5,6)))
//one.show()
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//| 1| 2| 3| 4| 5|
//| 1| 3| 4| 5| 6|
//+---+---+---+---+---+
val two = spark.createDataset[TABLE2](Seq(TABLE2(1,2,20)))
//two.show()
//+---+---+-------+
//| f1| f2|vTable2|
//+---+---+-------+
//| 1| 2| 20|
//+---+---+-------+
val three = spark.createDataset[TABLE3](Seq(TABLE3(3,4,20), TABLE3(3,5,50)))
//three.show()
//+---+---+-------+
//| f3| f4|vTable3|
//+---+---+-------+
//| 3| 4| 20|
//| 3| 5| 50|
//+---+---+-------+
val result = one
.join(two, Seq("f1", "f2"), "left")
.join(three, Seq("f3", "f4"), "left")
.join(
three.withColumnRenamed("f4", "f5").withColumnRenamed("vTable3", "vTable4"),
Seq("f3", "f5"),
"left"
)
//result.show()
//+---+---+---+---+---+-------+-------+-------+
//| f3| f5| f4| f1| f2|vTable2|vTable3|vTable4|
//+---+---+---+---+---+-------+-------+-------+
//| 3| 5| 4| 1| 2| 20| 20| 50|
//| 4| 6| 5| 1| 3| null| null| null|
//+---+---+---+---+---+-------+-------+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.