簡體   English   中英

在Spark中合並數據框

[英]Merging Dataframes in Spark

我有2個數據框,分別是A和B。我想將它們加入一個關鍵列並創建另一個數據框。 當鍵在A和B中匹配時,我需要從B而不是從A獲取行。

例如:

數據框A:

Employee1, salary100
Employee2, salary50
Employee3, salary200

數據框B

Employee1, salary150
Employee2, salary100
Employee4, salary300

結果數據框應為:

數據框C:

Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300

如何在Spark&Scala中做到這一點?

嘗試:

dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")

sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee), 
       coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")

要么

sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
  CASE dfB.employee IS NOT NULL THEN dfB.salary
  CASE dfB.employee IS NOT NULL THEN dfA.salary
  END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")

假設dfA和dfB有2列emp和sal。 您可以使用以下內容:

import org.apache.spark.sql.{functions => f}

val dfB1 = dfB
  .withColumnRenamed("sal", "salB")
  .withColumnRenamed("emp", "empB")

val joined = dfA
  .join(dfB1, 'emp === 'empB, "outer")
  .select(
    f.coalesce('empB, 'emp).as("emp"),
    f.coalesce('salB, 'sal).as("sal")
  )

注意:對於給定的emp值,每個數據框僅應有一行

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM