在Scala Spark中连接不同的Dataframe时动态选择多个列

Question

I have two spark data frame df1 and df2 . 我有两个火花数据帧df1和df2 。 Is there a way for selecting output columns dynamically while joining these two dataframes? 有没有办法在加入这两个数据帧时动态选择输出列？ The below definition outputs all column from df1 and df2 in case of inner join. 在内连接的情况下，以下定义输出来自df1和df2的所有列。

def joinDF (df1: DataFrame,  df2: DataFrame , joinExprs: Column, joinType: String): DataFrame = {   
  val dfJoinResult = df1.join(df2, joinExprs, joinType)
  dfJoinResult
  //.select()
}

Input data: 输入数据：

val df1 = List(("1","new","current"), ("2","closed","saving"), ("3","blocked","credit")).toDF("id","type","account")
val df2 = List(("1","7"), ("2","5"), ("5","8")).toDF("id","value")

Expected result: 预期结果：

val dfJoinResult = df1
  .join(df2, df1("id") === df2("id"), "inner")
  .select(df1("type"), df1("account"), df2("value"))

dfJoinResult.schema(): dfJoinResult.schema（）：

StructType(StructField(type,StringType,true), 
StructField(account,StringType,true), 
StructField(value,StringType,true))

I have looked at options like df.select(cols.head, cols.tail: _*) but it does not allow to select columns from both DF's. 我查看了df.select(cols.head, cols.tail: _*)等选项df.select(cols.head, cols.tail: _*)但它不允许从两个DF中选择列。 Is there a way to pass selectExpr columns dynamically along with dataframe details that we want to select it from in my def ? 有没有办法动态传递selectExpr列以及我们想要从我的def选择它的数据帧详细信息？ I'm using Spark 2.2.0. 我正在使用Spark 2.2.0。

Answer 1

It is possible to pass the select expression as a Seq[Column] to the method: 可以将select表达式作为Seq[Column]传递给方法：

def joinDF(df1: DataFrame,  df2: DataFrame , joinExpr: Column, joinType: String, selectExpr: Seq[Column]): DataFrame = {   
  val dfJoinResult = df1.join(df2, joinExpr, joinType)
  dfJoinResult.select(selectExpr:_*)
}

To call the method use: 要调用方法，请使用：

val joinExpr = df1.col("id") === df2.col("id")
val selectExpr = Seq(df1.col("type"), df1.col("account"), df2.col("value"))

val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)

This will give the desired result: 这将产生预期的结果：

+------+-------+-----+
|  type|account|value|
+------+-------+-----+
|   new|current|    7|
|closed| saving|    5|
+------+-------+-----+

In the selectExpr above, it is necessary to specify which dataframe the columns are coming from. 在上面的selectExpr ，有必要指定列来自哪个数据帧。 However, this can be further simplified if the following assumptions are true : 但是， 如果满足以下假设 ，则可以进一步简化 ：

The columns to join on have the same name in both dataframes 要join的列在两个数据框中具有相同的名称
The columns to be selected have unique names (the other dataframe do not have a column with the same name) 要选择的列具有唯一名称（另一个数据框没有具有相同名称的列）

In this case, the joinExpr: Column can be changed to joinExpr: Seq[String] and selectExpr: Seq[Column] to selectExpr: Seq[String] : 在这种情况下，可以将joinExpr: Column更改为joinExpr: Seq[String]和selectExpr: Seq[Column]以选择selectExpr: Seq[String] ：

def joinDF(df1: DataFrame,  df2: DataFrame , joinExpr: Seq[String], joinType: String, selectExpr: Seq[String]): DataFrame = {   
  val dfJoinResult = df1.join(df2, joinExpr, joinType)
  dfJoinResult.select(selectExpr.head, selectExpr.tail:_*)
}

Calling the method now looks cleaner: 现在调用方法看起来更干净：

val joinExpr = Seq("id")
val selectExpr = Seq("type", "account", "value")

val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)

Note : When the join is performed using a Seq[String] the column names of the resulting dataframe will be different as compared to using an expression. 注意：使用Seq[String]执行join ，与使用表达式相比，结果数据帧的列名称将不同。 When there are columns with the same name present, there will be no way to separately select these afterwards. 当存在具有相同名称的列时，之后将无法单独选择这些列。

Answer 2

A slightly modified solution from the one given above is before performing join, select the required columns from the DataFrames beforehand as it will have a little less overhead as there will be lesser no of columns to perform JOIN operation. 从上面给出的一个稍微修改过的解决方案是在执行连接之前，事先从DataFrames中选择所需的列，因为它将具有较少的开销，因为执行JOIN操作的列数较少。

val dfJoinResult = df1.select("column1","column2").join(df2.select("col1"),joinExpr,joinType)

But remember to select the columns on which you will be performing the join operations as it will first select the columns and then from the available data will from join operation. 但请记住选择要执行连接操作的列，因为它将首先选择列，然后从可用数据中选择连接操作。

在Scala Spark中连接不同的Dataframe时动态选择多个列

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-02-01 02:06:25

解决方案2
0 2019-02-07 18:31:35

在Scala Spark中连接不同的Dataframe时动态选择多个列

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-02-01 02:06:25

解决方案2 0 2019-02-07 18:31:35

解决方案1
3 已采纳 2018-02-01 02:06:25

解决方案2
0 2019-02-07 18:31:35