简体   繁体   English

Spark / Scala-当一个数据为空时,比较数据框中的两列

[英]Spark / Scala - Compare Two Columns In a Dataframe when one is NULL

I'm using Spark (Scala) to QA data movement - moving tables from one relational database to another. 我正在使用Spark(Scala)进行质量检查数据移动-将表从一个关系数据库移动到另一个关系数据库。 The QA process involves executing a full outer join between the source table, and the target table. QA流程涉及在源表和目标表之间执行完全外部联接。

The source table and target tables are joined in a data frame on the key(s): 源表和目标表在键上的数据帧中联接:

val joinColumns = for (i <- 0 to (sourceJoinFields.length - 1)) yield sourceDF.col(sourceJoinFields(i)) <=> targetDF.col(targetJoinFields(i))
val joinedDF = sourceDF.join(targetDF, joinColumns.reduce((_&&_)), "fullouter")

I'm using the following logic to find mismatches: 我使用以下逻辑来查找不匹配项:

val mismatchColumns = for (i <- 0 to (sourceDF.columns.length-1)) yield (joinedDF.col(joinedDF.columns(i)) =!= joinedDF.col(joinedDF.columns(i+(sourceDF.columns.length))))
val mismatchedDF = joinedDF.filter(mismatchColumns.reduce((_||_)))

However if there is a key missing from one side of the full outer join: 但是,如果完整外部联接的一侧缺少键:

+--------------+--------------+--------------+--------------+
|source_key    |source_field  |target_key    |target_field  |
+--------------+--------------+--------------+--------------+
|null          |null          |XXX           |XXX           |

will not be in the mismatchedDF data set. 将不会出现在不匹配的DF数据集中。

So my question: is the =!= operator the opposite of the <=> operator? 所以我的问题是: =!=运算符与<=>运算符相反吗? It does not appear to be, so is there an operator that will return FALSE for this case? 似乎不是,因此是否有一个运算符将针对这种情况返回FALSE? I can't find much documentation on either operator. 我找不到关于任何一个运算符的太多文档。

The opposite of IS NOT DISTINCT FROM ( <=> ) is IS DISTINCT FROM ( not(... <=> ...) ). IS NOT DISTINCT FROM<=> )的反面是IS DISTINCT FROMnot(... <=> ...) )。

import org.apache.spark.sql.not

val df = Seq(("foo", null), ("foo", "bar"), ("foo", "foo")).toDF("x", "y")
df.select(not($"x" <=> $"y"))

or 要么

df.select(!($"x" <=> $"y"))

or 要么

df.selectExpr("x IS DISTINCT FROM y")

all giving the same result: 全部给出相同的结果:

+---------------+
|(NOT (x <=> y))|
+---------------+
|           true|
|           true|
|          false|
+---------------+

Of course if you have a disjunction of negations: 当然,如果您有否定的取舍:

(NOT P) OR (NOT Q)

you can always use De Morgan's laws to rewrite it as a negation of concjunctions 您可以随时使用De Morgan的定律将其重写为对合取的否定

NOT(P AND Q)

therefore: 因此:

not(joinColumns.foldLeft(lit(true))(_ and _))

should work just fine. 应该工作正常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM