简体   繁体   English

Spark Scala JOIN 后所有类似 dataframe 列的动态比较

[英]Spark Scala dynamic comparing of all similar dataframe columns after JOIN

I am trying to compare columns in a JOINed table / relation that have similar names as a result of the JOIN - which can be qualified with l.我正在尝试比较 JOINed 表/关系中由于 JOIN 而具有相似名称的列 - 可以用 l 限定。 or r.或 r。

The code via a val to be used with withColumn to compare all columns lx with r.x does not work - but it does run without run-time error as shown below.通过valwithColumn一起使用来比较所有列lxr.x的代码不起作用 - 但它运行时没有运行时错误,如下所示。 Basically I want to compare all lx col names with all r.x col names.基本上我想将所有 lx col 名称与所有 r.x col 名称进行比较。

FULL LISTING完整清单

import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

val dfCur = sc.parallelize(Seq( 
    (1,"2019-01-01","2018-01-01",1,2),
    (7,"2019-01-01","2019-01-01",100,200),
    (3,"2019-01-01","2019-01-03",5,6)
)).toDF("customer_id", "report_date", "date", "value_1", "value_2")

val dfRaw = sc.parallelize(Seq( 
   (2,"2019-01-01","2019-01-01",1,2),
   (7,"2019-01-01","2019-01-01",100,300),
   (3,"2019-01-01","2019-01-03",5,6)
)).toDF("customer_id", "report_date", "date", "value_1", "value_2")

// contrived and not necessarily correct data, but that is not the focus
val difference = dfCur.union(dfRaw).except(dfCur.intersect(dfRaw))

val prefixL = "l."
val prefixR = "r."

//val diffCols = difference.columns
//    .map(c=> difference(c).as(s"$prefixL$c"))
//    .map(x => when(x =!= s"$prefixR$x", concat(lit(","), lit(x))).otherwise(","))
//    .reduce(concat(_, _))

// Updated attempt, to no avail
val diffCols = difference.columns
.map(c=> (c,difference(c).as(s"$prefixL$c")))
.map(x => when(x._2 =!= s"${prefixR}${x._1}", concat(lit(","), lit(x._1))).otherwise(","))
.reduce(concat(_,_))

val result = difference.as("l")
    .join(dfRaw.as("r"), $"l.customer_id" === $"r.customer_id","inner")
    .withColumn("XYZ2", diffCols)

Output something like this below. Output 如下所示。

+-----------+-----------+----------+-------+-------+-----------+-----------+----------+-------+-------+-------------------------+
|customer_id|report_date|date      |value_1|value_2|customer_id|report_date|date      |value_1|value_2|XYZ2                     |
+-----------+-----------+----------+-------+-------+-----------+-----------+----------+-------+-------+-------------------------+
|2          |2019-01-01 |2019-01-01|1      |2      |7          |2019-01-01 |2019-01-01|100    |300    |,,date_1,value_1,,|

UPDATE更新

Aspects applied as per suggestion in comments.根据评论中的建议应用方面。 Ran, but did not produce correct results.跑了,但没有产生正确的结果。

Thanks for the help as it was partially correct, but the final clue was to add col(...) like this:感谢您的帮助,因为它部分正确,但最后的线索是像这样添加 col(...) :

val diffCols2 = difference
    .columns
    .map(c=> (c,difference(c).as(s"${prefixL}${c}")))
    .map(x => when(x._2 =!= col(s"${prefixR}${x._1}"), concat(lit(","), lit(x._1)))
                                            .otherwise(","))
    .reduce(concat(_,_)) 

col(s"${prefixR}${x._1}") is the point. col(s"${prefixR}${x._1}")是重点。

I noted that mapping successively can change things.我注意到连续映射可以改变事情。 It's not always obvious.这并不总是显而易见的。 In any event a way of getting the differences over a JOINed table with similar names issues.无论如何,一种通过具有相似名称问题的 JOINed 表获得差异的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM