[英]Merging Dataframes in Spark
I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. 我有2个数据框,分别是A和B。我想将它们加入一个关键列并创建另一个数据框。 When the keys match in A & B, I need to get the row from B, not from A.
当键在A和B中匹配时,我需要从B而不是从A获取行。
For example: 例如:
DataFrame A: 数据框A:
Employee1, salary100
Employee2, salary50
Employee3, salary200
DataFrame B 数据框B
Employee1, salary150
Employee2, salary100
Employee4, salary300
The resulting DataFrame should be: 结果数据框应为:
DataFrame C: 数据框C:
Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300
How can I do this in Spark & Scala? 如何在Spark&Scala中做到这一点?
Try: 尝试:
dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
or 要么
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
CASE dfB.employee IS NOT NULL THEN dfB.salary
CASE dfB.employee IS NOT NULL THEN dfA.salary
END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
Assuming dfA and dfB have 2 columns emp and sal. 假设dfA和dfB有2列emp和sal。 You can use the following:
您可以使用以下内容:
import org.apache.spark.sql.{functions => f}
val dfB1 = dfB
.withColumnRenamed("sal", "salB")
.withColumnRenamed("emp", "empB")
val joined = dfA
.join(dfB1, 'emp === 'empB, "outer")
.select(
f.coalesce('empB, 'emp).as("emp"),
f.coalesce('salB, 'sal).as("sal")
)
NB: you should have only one row per Dataframe for a giving value of emp 注意:对于给定的emp值,每个数据框仅应有一行
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.