在Spark中合并数据框

Question

I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. 我有2个数据框，分别是A和B。我想将它们加入一个关键列并创建另一个数据框。 When the keys match in A & B, I need to get the row from B, not from A. 当键在A和B中匹配时，我需要从B而不是从A获取行。

For example: 例如：

DataFrame A: 数据框A：

Employee1, salary100
Employee2, salary50
Employee3, salary200

DataFrame B 数据框B

Employee1, salary150
Employee2, salary100
Employee4, salary300

The resulting DataFrame should be: 结果数据框应为：

DataFrame C: 数据框C：

Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300

How can I do this in Spark & Scala? 如何在Spark＆Scala中做到这一点？

Answer 1

Try: 尝试：

dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")

sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee), 
       coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")

or 要么

sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
  CASE dfB.employee IS NOT NULL THEN dfB.salary
  CASE dfB.employee IS NOT NULL THEN dfA.salary
  END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")

Answer 2

Assuming dfA and dfB have 2 columns emp and sal. 假设dfA和dfB有2列emp和sal。 You can use the following: 您可以使用以下内容：

import org.apache.spark.sql.{functions => f}

val dfB1 = dfB
  .withColumnRenamed("sal", "salB")
  .withColumnRenamed("emp", "empB")

val joined = dfA
  .join(dfB1, 'emp === 'empB, "outer")
  .select(
    f.coalesce('empB, 'emp).as("emp"),
    f.coalesce('salB, 'sal).as("sal")
  )

NB: you should have only one row per Dataframe for a giving value of emp 注意：对于给定的emp值，每个数据框仅应有一行

在Spark中合并数据框

问题描述

2 个解决方案

解决方案1
1 已采纳

解决方案2
1 2016-08-01 20:57:20

在Spark中合并数据框

问题描述

2 个解决方案

解决方案1 1 已采纳

解决方案2 1 2016-08-01 20:57:20

解决方案1
1 已采纳

解决方案2
1 2016-08-01 20:57:20