简体   繁体   English

Spark-scala Join 的问题。 寻找更好的方法

[英]Issue with Spark-scala Join . Looking for a better Approach

I have 2 DF,s like below.我有 2 个 DF,如下所示。

+---+---+---+
|  M| c2| c3|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
+---+---+---+

+---+---+---+
|  M| c2| c3|
+---+---+---+
|  1| 20| 30|
|  2| 30| 40|
+---+---+---+

What should be the best approach to get a new dataframe like below.This means, the new Df has column names c2 and c3 but value is concat( df1("c1"),df1("c2") ) but with same column name.I can do this with df3.withColumn("c2_new",concat( df1("c2"),df2("c2") )) and then renaming the new column to C2.获得如下新数据框的最佳方法应该是什么。这意味着,新的 Df 具有列名 c2 和 c3,但值是concat( df1("c1"),df1("c2") )但具有相同的列名.我可以用df3.withColumn("c2_new",concat( df1("c2"),df2("c2") ))做到这一点,然后将新列重命名为 C2。 But ssue is that I have 150+ Columns in my DF.What should be the best approach here?但问题是我的 DF 中有 150 多个列。这里最好的方法是什么?

+---+------+-----+
|  M| c2  |   c3 |
+---+-----+------+
|  1| 2_20|  3_30|
|  2| 3_30|  4_40|
+---+------+-----+

If you have a wide columns, you could iterate over columns and apply the same transformations for it.如果您有很宽的列,则可以遍历列并对其应用相同的转换。 In your case you should merge dataframes and aggregate columns like this:在您的情况下,您应该像这样合并数据框和聚合列:

import org.apache.spark.sql.types.StringType

val commonColumns = (df1.columns.toSet & df2.columns.toSet).filter(_ != "M").toSeq
commonColumns

df1.union(df2)
    .groupBy("M")
    .agg(count(lit(1)) as "cnt", 
        commonColumns.map(c => concat_ws("_", collect_set(col(c).cast(StringType))) as c):_*)
    .select("M", commonColumns:_*)
        .show

Here is the output:这是输出:

+---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|20_2|3_30|
|  2|3_30|40_4|
+---+----+----+

If you have requirement on ordering (ie value from df1 must be on the left side, value from df2 must by on the right) you could use this trick:如果您对排序有要求(即来自df1值必须在左侧,来自df2值必须在右侧),您可以使用以下技巧:

  1. Add dateframe number ( 1 and 2 ) before union as a new columnunion之前添加日期帧编号( 12 )作为新列
  2. Create structure from dataframe number and column value从数据框编号和列值创建结构
  3. During aggregation take min and max of this structure在聚合期间取这个结构的minmax
  4. Extract the value from the structure从结构中提取值
  5. Concat values with an underscore带下划线的连接值

Code:代码:

df1
    .withColumn("src", lit(1))
    .union(df2.withColumn("src", lit(2)))
    .groupBy("M")
    .agg(count(lit(1)) as "cnt", 
        commonColumns.map(c => concat(
            min(struct($"src", col(c)))(c),
            lit("_"),
            max(struct($"src", col(c)))(c)) as c):_*)
    .select("M", commonColumns:_*)
    .show

The final result is ordered:最终结果排序:

+---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|2_20|3_30|
|  2|3_30|4_40|
+---+----+----+

You can do this with a join:您可以通过连接来做到这一点:

val selectExpr = df1.columns.filterNot(_=="M").map(c => concat_ws("_",df1(c),df2(c)).as(c))

df1.join(df2,"M")
  .select((col("M") +: selectExpr):_*)
  .show()

gives:给出:

---+----+----+
|  M|  c2|  c3|
+---+----+----+
|  1|2_20|3_30|
|  2|3_30|4_40|
+---+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM