Scala将DataFrame列作为地图并使用foldleft进行比较

Question

All I want to achieve is: Image 1 So in here you can see in first Image, I have data frame in which first 4 row has correct hash values stored in corresponding columns("col_1_hash" has hash value of "col_1" and "col_2_hash" has hash value of "col_2") in row 5 both has worng hash values (col_1:a, col_2:z, col_1_hash: has hash value of "z", col_2_hash: has hash value of "a") and row 6 has one right and one worng values(col_1:d, col_2:w, col_1_hash: has hash value of "d"(correct), col_2_hash: has hash value of "z"(wrong))我想要实现的是：图像 1所以在这里你可以在第一个图像中看到，我有数据框，其中前 4 行具有存储在相应列中的正确哈希值（“col_1_hash”具有“col_1”和“col_2_hash”的哈希值" 的散列值为 "col_2") 在第 5 行中都有散列值 (col_1:a, col_2:z, col_1_hash: 散列值为 "z", col_2_hash: 散列值为 "a") 并且第 6 行有一个正确和一个磨损的值（col_1:d, col_2:w, col_1_hash：哈希值为“d”（正确），col_2_hash：哈希值为“z”（错误））

val totallytemp = xtranwedf.filter(( sha2($"col_1",256)  =!= $"col_1_hash") ||
  (sha2($"col_2",256)  =!= $"col_2_hash"))
val total = totallytemp.count

this will give output:这将给出输出：

total: Long = 2

Above results is what I want to achieve with foldLeft.以上结果是我想用 foldLeft 实现的结果。 As there is two records where atleastonematch is there.因为有两个记录在那里 atleastonematch。

now in here I know there easy way to achieve this but it's just I don't want to pass hard-coded values.现在在这里我知道有一种简单的方法可以实现这一点，但只是我不想传递硬编码值。

So I am performing collect on dataframe and getting list of values and creatinig map out of it.所以我正在对数据框执行收集并从中获取值列表和创建地图。 you will see in second image.你会在第二张图片中看到。 Image 2 so in here I am passing map and creating accumulator but it doesn't give answer it should.图片 2所以在这里我正在传递地图并创建累加器，但它没有给出它应该的答案。 as you will see in image 1 answer I want is 2 but it this code gives answer 6.正如您将在图 1 中看到的，我想要的答案是 2，但此代码给出了答案 6。

val templist = "col_1" :: "col_2" :: Nil
val tempmapingList = Map(templist map {s => (s, s + "_hash")} : _*)

val expr: Column = tempmapingList.foldLeft(lit(false)) 
  { 
  case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h) 
  }
xtranwedf.filter(expr).count

this gives output:这给出了输出：

total: Long = 6

I want here to be 2. but I think it has something to do with the === or = sign where it is not creating new column on which I can perform count.我希望这里是 2。但我认为它与 === 或 = 符号有关，因为它没有创建我可以执行计数的新列。

Answer 1

The problem with your foldLeft application is that it is not equivalent to the expression you want to use.您的foldLeft应用程序的问题在于它不等同于您要使用的表达式。

As you've said you're looking for正如你所说，你正在寻找

sha2(b, 256) = b_hash OR sha2(c, 256) = c_hash OR sha2(d, 256) = d_hash

while a chained filter on a DataFrame results in而DataFrame上的链式过滤器DataFrame导致

sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash

To achieve the former one you should change the accumulator:要实现前一个，您应该更改累加器：

import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.Column

val atLeastOneMatch: Column = map.foldLeft(lit(false)) { 
  case (acc, (c, h)) => acc or (sha2(col(c), 256) === h) 
}

and then use the result to filter the data然后使用结果过滤数据

df.filter(atLeastOneMatch).count

This will count all the rows where at least one column matches a hash provided by the map .这将计算至少一列与map提供的哈希匹配的所有行。 By De Morgan's laws its negation根据德摩根定律，它的否定

!atLeastOneMatch

will be equivalent to将等价于

sha2(b, 256) != b_hash AND sha2(c, 256) != c_hash AND sha2(d, 256) = d_hash

In other words it will match cases where non of the values matches corresponding hash.换句话说，它将匹配没有值匹配相应散列的情况。

If you want to find rows where at least one value is doesn't match a hash you should use如果您想查找至少一个值与哈希不匹配的行，您应该使用

sha2(b, 256) != b_hash OR sha2(c, 256) != c_hash OR sha2(d, 256) != d_hash

which can be composed as shown below可以组成如下所示

val atLeastOneMismatch: Column = map.foldLeft(lit(false)) { 
  case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h) 
}

It's negation这是否定

!atLeastOneMismatch

is equivalent (De Morgan's laws once again)是等价的（再次德摩根定律）

sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash

and further equivalent to foldLeft with DataFrame accumulator and === .并且还相当于foldLeft与DataFrame累加器和=== 。

So to summarize - if C is a set of columns, then:所以总结一下 - 如果C是一组列，那么：

∃c∈C map(c) = sha2(c, 256) - atLeastOneMatch ∃c∈C map(c) = sha2(c, 256) - atLeastOneMatch
∀c∈C map(c) != sha2(c, 256) - !atLeastOneMatch ∀c∈C map(c) != sha2(c, 256) - !atLeastOneMatch
∃c∈C map(c) != sha2(c, 256) - atLeastOneMismatch ∃c∈C map(c) != sha2(c, 256) - atLeastOneMismatch
∀c∈C map(c) = sha2(c, 256) - !atLeastOneMismatch ∀c∈C map(c) = sha2(c, 256) - !atLeastOneMismatch

Scala将DataFrame列作为地图并使用foldleft进行比较

问题描述

1 个解决方案

解决方案1
0 2019-03-06 22:24:19

Scala将DataFrame列作为地图并使用foldleft进行比较

问题描述

1 个解决方案

解决方案1 0 2019-03-06 22:24:19

解决方案1
0 2019-03-06 22:24:19