简体   繁体   English

Scala将DataFrame列作为地图并使用foldleft进行比较

[英]Scala spark DataFrame columns as map and compare them using foldleft

All I want to achieve is: Image 1 So in here you can see in first Image, I have data frame in which first 4 row has correct hash values stored in corresponding columns("col_1_hash" has hash value of "col_1" and "col_2_hash" has hash value of "col_2") in row 5 both has worng hash values (col_1:a, col_2:z, col_1_hash: has hash value of "z", col_2_hash: has hash value of "a") and row 6 has one right and one worng values(col_1:d, col_2:w, col_1_hash: has hash value of "d"(correct), col_2_hash: has hash value of "z"(wrong))我想要实现的是:图像 1所以在这里你可以在第一个图像中看到,我有数据框,其中前 4 行具有存储在相应列中的正确哈希值(“col_1_hash”具有“col_1”和“col_2_hash”的哈希值" 的散列值为 "col_2") 在第 5 行中都有散列值 (col_1:a, col_2:z, col_1_hash: 散列值为 "z", col_2_hash: 散列值为 "a") 并且第 6 行有一个正确和一个磨损的值(col_1:d, col_2:w, col_1_hash:哈希值为“d”(正确),col_2_hash:哈希值为“z”(错误))

val totallytemp = xtranwedf.filter(( sha2($"col_1",256)  =!= $"col_1_hash") ||
  (sha2($"col_2",256)  =!= $"col_2_hash"))
val total = totallytemp.count

this will give output:这将给出输出:

total: Long = 2

Above results is what I want to achieve with foldLeft.以上结果是我想用 foldLeft 实现的结果。 As there is two records where atleastonematch is there.因为有两个记录在那里 atleastonematch。

now in here I know there easy way to achieve this but it's just I don't want to pass hard-coded values.现在在这里我知道有一种简单的方法可以实现这一点,但只是我不想传递硬编码值。

So I am performing collect on dataframe and getting list of values and creatinig map out of it.所以我正在对数据框执行收集并从中获取值列表和创建地图。 you will see in second image.你会在第二张图片中看到。 Image 2 so in here I am passing map and creating accumulator but it doesn't give answer it should.图片 2所以在这里我正在传递地图并创建累加器,但它没有给出它应该的答案。 as you will see in image 1 answer I want is 2 but it this code gives answer 6.正如您将在图 1 中看到的,我想要的答案是 2,但此代码给出了答案 6。

val templist = "col_1" :: "col_2" :: Nil
val tempmapingList = Map(templist map {s => (s, s + "_hash")} : _*)

val expr: Column = tempmapingList.foldLeft(lit(false)) 
  { 
  case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h) 
  }
xtranwedf.filter(expr).count

this gives output:这给出了输出:

total: Long = 6

I want here to be 2. but I think it has something to do with the === or = sign where it is not creating new column on which I can perform count.我希望这里是 2。但我认为它与 === 或 = 符号有关,因为它没有创建我可以执行计数的新列。

The problem with your foldLeft application is that it is not equivalent to the expression you want to use.您的foldLeft应用程序的问题在于它不等同于您要使用的表达式。

As you've said you're looking for正如你所说,你正在寻找

sha2(b, 256) = b_hash OR sha2(c, 256) = c_hash OR sha2(d, 256) = d_hash

while a chained filter on a DataFrame results inDataFrame上的链式过滤器DataFrame导致

sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash

To achieve the former one you should change the accumulator:要实现前一个,您应该更改累加器:

import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.Column

val atLeastOneMatch: Column = map.foldLeft(lit(false)) { 
  case (acc, (c, h)) => acc or (sha2(col(c), 256) === h) 
}

and then use the result to filter the data然后使用结果过滤数据

df.filter(atLeastOneMatch).count

This will count all the rows where at least one column matches a hash provided by the map .这将计算至少一列与map提供的哈希匹配的所有行。 By De Morgan's laws its negation根据德摩根定律,它的否定

!atLeastOneMatch

will be equivalent to将等价于

sha2(b, 256) != b_hash AND sha2(c, 256) != c_hash AND sha2(d, 256) = d_hash

In other words it will match cases where non of the values matches corresponding hash.换句话说,它将匹配没有值匹配相应散列的情况。

If you want to find rows where at least one value is doesn't match a hash you should use如果您想查找至少一个值与哈希不匹配的行,您应该使用

sha2(b, 256) != b_hash OR sha2(c, 256) != c_hash OR sha2(d, 256) != d_hash

which can be composed as shown below可以组成如下所示

val atLeastOneMismatch: Column = map.foldLeft(lit(false)) { 
  case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h) 
}

It's negation这是否定

!atLeastOneMismatch

is equivalent (De Morgan's laws once again)是等价的(再次德摩根定律)

sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash

and further equivalent to foldLeft with DataFrame accumulator and === .并且还相当于foldLeftDataFrame累加器和===

So to summarize - if C is a set of columns, then:所以总结一下 - 如果C是一组列,那么:

  • ∃c∈C map(c) = sha2(c, 256) - atLeastOneMatch ∃c∈C map(c) = sha2(c, 256) - atLeastOneMatch
  • ∀c∈C map(c) != sha2(c, 256) - !atLeastOneMatch ∀c∈C map(c) != sha2(c, 256) - !atLeastOneMatch
  • ∃c∈C map(c) != sha2(c, 256) - atLeastOneMismatch ∃c∈C map(c) != sha2(c, 256) - atLeastOneMismatch
  • ∀c∈C map(c) = sha2(c, 256) - !atLeastOneMismatch ∀c∈C map(c) = sha2(c, 256) - !atLeastOneMismatch

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM