简体   繁体   English

在Spark Scala中的joinWith之后返回组合的数据集

[英]Return combined Dataset after joinWith in Spark Scala

Given the below two Spark Datasets , flights and capitals , what would be the most efficient way to return combined (ie "joined") result without converting first to a DataFrame or writing out all the columns out by name in a .select() method? 给定以下两个Spark Datasetsflightscapitals ,最有效的方法是返回组合 (即“联接”)结果而不先转换为DataFrame或在.select()方法中按名称写出所有列的最有效方法? I know, for example, that I can access either tuple with (eg .map(x => x._1 ) or use the * operator with: 我知道,例如,我可以使用(例如.map(x => x._1 ))访问元组,或者将*运算符与以下内容一起使用:

result.select("_1.*","_2.*")

But the latter may result in duplicate column names and I'm hoping for a cleaner solution. 但是后者可能导致重复的列名,我希望有一个更干净的解决方案。

Thank you for your help. 谢谢您的帮助。

case class Flights(tripNumber: Int, destination: String)

case class Capitals(state: String, capital: String)

val flights = Seq(
  (55, "New York"),
  (3, "Georgia"),
  (12, "Oregon")
).toDF("tripNumber","destination").as[Flights]

val capitals = Seq(
  ("New York", "Albany"),
  ("Georgia", "Atlanta"),
  ("Oregon", "Salem")
).toDF("state","capital").as[Capitals]

val result = flights.joinWith(capitals,flights.col("destination")===capitals.col("state"))

There are 2 options, but you will have to use join instead of joinWith : 有2个选项,但是您必须使用join而不是joinWith

  1. That is the best part of the Dataset API, is to drop one of the join columns , thus no need to repeat projection columns in a select: val result = flights.join(capitals,flights("destination")===capitals("state")).drop(capitals("state")) 那是Dataset API最好的部分,就是删除其中一个join列,因此无需在select中重复投影列: val result = flights.join(capitals,flights("destination")===capitals("state")).drop(capitals("state"))
  2. rename join column to be the same in both datasets and use a slightly different way of specifying the join: val result = flights.join(capitals.withColumnRenamed("state", "destination"), Seq("destination")) 将两个数据集中的连接列重命名为相同,并使用稍微不同的方式指定连接: val result = flights.join(capitals.withColumnRenamed("state", "destination"), Seq("destination"))

Output: 输出:

result.show
+-----------+----------+-------+
|destination|tripNumber|capital|
+-----------+----------+-------+
|   New York|        55| Albany|
|    Georgia|         3|Atlanta|
|     Oregon|        12|  Salem|
+-----------+----------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM