Spark-scala Join 的问题。寻找更好的方法

Question

I have 2 DF,s like below.我有 2 个 DF，如下所示。

+---+---+---+
|  M| c2| c3|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
+---+---+---+

+---+---+---+
|  M| c2| c3|
+---+---+---+
|  1| 20| 30|
|  2| 30| 40|
+---+---+---+

What should be the best approach to get a new dataframe like below.This means, the new Df has column names c2 and c3 but value is concat( df1("c1"),df1("c2") ) but with same column name.I can do this with df3.withColumn("c2_new",concat( df1("c2"),df2("c2") )) and then renaming the new column to C2.获得如下新数据框的最佳方法应该是什么。这意味着，新的 Df 具有列名 c2 和 c3，但值是concat( df1("c1"),df1("c2") )但具有相同的列名.我可以用df3.withColumn("c2_new",concat( df1("c2"),df2("c2") ))做到这一点，然后将新列重命名为 C2。 But ssue is that I have 150+ Columns in my DF.What should be the best approach here?但问题是我的 DF 中有 150 多个列。这里最好的方法是什么？

+---+------+-----+
|  M| c2  |   c3 |
+---+-----+------+
|  1| 2_20|  3_30|
|  2| 3_30|  4_40|
+---+------+-----+

Answer 1

If you have a wide columns, you could iterate over columns and apply the same transformations for it.如果您有很宽的列，则可以遍历列并对其应用相同的转换。 In your case you should merge dataframes and aggregate columns like this:在您的情况下，您应该像这样合并数据框和聚合列：

import org.apache.spark.sql.types.StringType

val commonColumns = (df1.columns.toSet & df2.columns.toSet).filter(_ != "M").toSeq
commonColumns

df1.union(df2)
    .groupBy("M")
    .agg(count(lit(1)) as "cnt", 
        commonColumns.map(c => concat_ws("_", collect_set(col(c).cast(StringType))) as c):_*)
    .select("M", commonColumns:_*)
        .show

Here is the output:这是输出：

+---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|20_2|3_30|
|  2|3_30|40_4|
+---+----+----+

If you have requirement on ordering (ie value from df1 must be on the left side, value from df2 must by on the right) you could use this trick:如果您对排序有要求（即来自df1值必须在左侧，来自df2值必须在右侧），您可以使用以下技巧：

Add dateframe number ( 1 and 2 ) before union as a new column在union之前添加日期帧编号（ 1和2 ）作为新列
Create structure from dataframe number and column value从数据框编号和列值创建结构
During aggregation take min and max of this structure在聚合期间取这个结构的min和max
Extract the value from the structure从结构中提取值
Concat values with an underscore带下划线的连接值

Code:代码：

df1
    .withColumn("src", lit(1))
    .union(df2.withColumn("src", lit(2)))
    .groupBy("M")
    .agg(count(lit(1)) as "cnt", 
        commonColumns.map(c => concat(
            min(struct($"src", col(c)))(c),
            lit("_"),
            max(struct($"src", col(c)))(c)) as c):_*)
    .select("M", commonColumns:_*)
    .show

The final result is ordered:最终结果排序：

+---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|2_20|3_30|
|  2|3_30|4_40|
+---+----+----+

Answer 2

You can do this with a join:您可以通过连接来做到这一点：

val selectExpr = df1.columns.filterNot(_=="M").map(c => concat_ws("_",df1(c),df2(c)).as(c))

df1.join(df2,"M")
  .select((col("M") +: selectExpr):_*)
  .show()

gives:给出：

---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|2_20|3_30|
|  2|3_30|4_40|
+---+----+----+

Spark-scala Join 的问题。寻找更好的方法

问题描述

2 个解决方案

解决方案1
2 2019-12-11 12:20:55

解决方案2
2 已采纳 2019-12-11 20:19:02

Spark-scala Join 的问题。 寻找更好的方法

问题描述

2 个解决方案

解决方案1 2 2019-12-11 12:20:55

解决方案2 2 已采纳 2019-12-11 20:19:02

Spark-scala Join 的问题。寻找更好的方法

解决方案1
2 2019-12-11 12:20:55

解决方案2
2 已采纳 2019-12-11 20:19:02