[英]Join two dataframes with different records and size in Spark
It seems this issue asked couple of times, but the solutions that suggested in previous questions not working for me.似乎这个问题被问了几次,但在以前的问题中提出的解决方案对我不起作用。
I have two dataframe s with different dimensions as shown in picture below.我有两个不同尺寸的数据框,如下图所示。 The table two
second
was part of table one first
but after some processing on it I added one more column column4
. second
表first
是表一的一部分,但经过一些处理后,我又添加了一列column4
。 Now I want to join these two tables such that I have table three Required
after joining.现在,我想加入这两个表,这样我有三个表
Required
加盟后。
Things that tried.尝试过的东西。
So I did couple of different solution but no one works for me.所以我做了几个不同的解决方案,但没有一个适合我。
I tried我试过
val required =first.join(second, first("PDE_HDR_CMS_RCD_NUM") === second("PDE_HDR_CMS_RCD_NUM") , "left_outer")
Also I tried我也试过
val required = first.withColumn("SEQ", when(second.col("PDE_HDR_FILE_ID") === (first.col("PDE_HDR_FILE_ID").alias("PDE_HDR_FILE_ID1")), second.col("uniqueID")).otherwise(lit(0)))
In the second attempt I used .alias
after I get an error that says在第二次尝试中,我在收到错误
.alias
后使用了.alias
Error occured during extract process.
提取过程中发生错误。 Error: org.apache.spark.sql.AnalysisException: Resolved attribute(s) uniqueID#775L missing from.
错误:org.apache.spark.sql.AnalysisException:已解析的属性(s)uniqueID#775L 缺失。
Thanks for taking time to read my question感谢您花时间阅读我的问题
To generate the wanted result, you should join the two tables on column(s) that are row-identifying in your first table.要生成所需的结果,您应该在第一个表中的行标识列上连接两个表。 Assuming
c1 + c2 + c3
uniquely identifies each row in the first table, here's an example using a partial set of your sample data:假设
c1 + c2 + c3
唯一标识第一个表中的每一行,以下是使用部分样本数据集的示例:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq(
(1, "e", "o"),
(4, "d", "t"),
(3, "f", "e"),
(2, "r", "r"),
(6, "y", "f"),
(5, "t", "g"),
(1, "g", "h"),
(4, "f", "j"),
(6, "d", "k"),
(7, "s", "o")
).toDF("c1", "c2", "c3")
val df2 = Seq(
(3, "f", "e", 444),
(5, "t", "g", 555),
(7, "s", "o", 666)
).toDF("c1", "c2", "c3", "c4")
df1.join(df2, Seq("c1", "c2", "c3"), "left_outer").show
// +---+---+---+----+
// | c1| c2| c3| c4|
// +---+---+---+----+
// | 1| e| o|null|
// | 4| d| t|null|
// | 3| f| e| 444|
// | 2| r| r|null|
// | 6| y| f|null|
// | 5| t| g| 555|
// | 1| g| h|null|
// | 4| f| j|null|
// | 6| d| k|null|
// | 7| s| o| 666|
// +---+---+---+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.