簡體   English   中英

在具有不同模式的兩個 spark 數據幀上執行合並/插入?

[英]Perform merge/insert on two spark dataframes with different schemas?

我有 spark dataframe df 和 df1 都具有不同的模式。

東風:-

val DF  = Seq(("1","acv","34","a","1"),("2","fbg","56","b","3"),("3","rty","78","c","5")).toDF("id","name","age","DBName","test")

+---+----+---+------+----+
| id|name|age|DBName|test|
+---+----+---+------+----+
|  1| acv| 34|     a|   1|
|  2| fbg| 56|     b|   3|
|  3| rty| 78|     c|   5|
+---+----+---+------+----+

DF1:-

val DF1= Seq(("1","gbj","67","a","5"),("2","gbj","67","a","7"),("2","jku","88","b","8"),("4","jku","88","b",7"),("5","uuu","12","c","9")).toDF("id","name","age","DBName","col1")
    
+---+----+---+------+----+
| id|name|age|DBName|col1|
+---+----+---+------+----+
|  1| gbj| 67|     a|   5|
|  2| gbj| 67|     a|   7|
|  2| jku| 88|     b|   8|
|  4| jku| 88|     b|   7|
|  5| uuu| 12|     c|   9|
+---+----+---+------+----+

我想根據 id 和 DBName 的值將 DF1 與 DF 合並。 因此,如果我的 id 和 DBName 已經存在於 DF 中,則應更新記錄,如果 id 和 DBName 不存在,則應添加新記錄。 所以生成的數據框應該是這樣的:

    +---+----+---+------+----+----+
    | id|name|age|DBName|Test|col |
    +---+----+---+------+----+----+
    |  5| uuu| 12|     c|NULL|9   |
    |  2| jku| 88|     b|NULL|8   |
    |  4| jku| 88|     b|NULL|7   |
    |  1| gbj| 67|     a|NULL|5   |  
    |  3| rty| 78|     c|5   |NULL|
    |  2| gbj| 67|     a|NULL|7   |
    +---+----+---+------+----+----+

到目前為止我已經嘗試過

val updatedDF = DF.as("a").join(DF1.as("b"), $"a.id" === $"b.id" &&  $"a.DBName" === $"b.DBName", "outer").select(DF.columns.map(c => coalesce($"b.$c", $"b.$c") as c): _*)

錯誤:-

org.apache.spark.sql.AnalysisException: cannot resolve '`b.test`' given input columns: [b.DBName, a.DBName, a.name, b.age, a.id, a.age, b.id, a.test, b.name];;

您正在選擇不存在的列,並且在coalesce中也有一個錯字。 您可以按照以下示例解決您的問題:

val updatedDF = DF.as("a").join(
    DF1.as("b"), 
    $"a.id" === $"b.id" &&  $"a.DBName" === $"b.DBName", 
    "outer"
).select(
    DF.columns.dropRight(1).map(c => coalesce($"b.$c", $"a.$c") as c) 
    :+ col(DF.columns.last) 
    :+ col(DF1.columns.last)
    :_*
)

updatedDF.show
+---+----+---+------+----+----+
| id|name|age|DBName|test|col1|
+---+----+---+------+----+----+
|  5| uuu| 12|     c|null|   9|
|  2| jku| 88|     b|   3|   8|
|  4| jku| 88|     b|null|   7|
|  1| gbj| 67|     a|   1|   5|
|  3| rty| 78|     c|   5|null|
|  2| gbj| 67|     a|null|   7|
+---+----+---+------+----+----+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM