[英]Scala spark DataFrame columns as map and compare them using foldleft
All I want to achieve is: Image 1 So in here you can see in first Image, I have data frame in which first 4 row has correct hash values stored in corresponding columns("col_1_hash" has hash value of "col_1" and "col_2_hash" has hash value of "col_2") in row 5 both has worng hash values (col_1:a, col_2:z, col_1_hash: has hash value of "z", col_2_hash: has hash value of "a") and row 6 has one right and one worng values(col_1:d, col_2:w, col_1_hash: has hash value of "d"(correct), col_2_hash: has hash value of "z"(wrong))我想要实现的是:图像 1所以在这里你可以在第一个图像中看到,我有数据框,其中前 4 行具有存储在相应列中的正确哈希值(“col_1_hash”具有“col_1”和“col_2_hash”的哈希值" 的散列值为 "col_2") 在第 5 行中都有散列值 (col_1:a, col_2:z, col_1_hash: 散列值为 "z", col_2_hash: 散列值为 "a") 并且第 6 行有一个正确和一个磨损的值(col_1:d, col_2:w, col_1_hash:哈希值为“d”(正确),col_2_hash:哈希值为“z”(错误))
val totallytemp = xtranwedf.filter(( sha2($"col_1",256) =!= $"col_1_hash") ||
(sha2($"col_2",256) =!= $"col_2_hash"))
val total = totallytemp.count
this will give output:这将给出输出:
total: Long = 2
Above results is what I want to achieve with foldLeft.以上结果是我想用 foldLeft 实现的结果。 As there is two records where atleastonematch is there.
因为有两个记录在那里 atleastonematch。
now in here I know there easy way to achieve this but it's just I don't want to pass hard-coded values.现在在这里我知道有一种简单的方法可以实现这一点,但只是我不想传递硬编码值。
So I am performing collect on dataframe and getting list of values and creatinig map out of it.所以我正在对数据框执行收集并从中获取值列表和创建地图。 you will see in second image.
你会在第二张图片中看到。 Image 2 so in here I am passing map and creating accumulator but it doesn't give answer it should.
图片 2所以在这里我正在传递地图并创建累加器,但它没有给出它应该的答案。 as you will see in image 1 answer I want is 2 but it this code gives answer 6.
正如您将在图 1 中看到的,我想要的答案是 2,但此代码给出了答案 6。
val templist = "col_1" :: "col_2" :: Nil
val tempmapingList = Map(templist map {s => (s, s + "_hash")} : _*)
val expr: Column = tempmapingList.foldLeft(lit(false))
{
case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h)
}
xtranwedf.filter(expr).count
this gives output:这给出了输出:
total: Long = 6
I want here to be 2. but I think it has something to do with the === or = sign where it is not creating new column on which I can perform count.我希望这里是 2。但我认为它与 === 或 = 符号有关,因为它没有创建我可以执行计数的新列。
The problem with your foldLeft
application is that it is not equivalent to the expression you want to use.您的
foldLeft
应用程序的问题在于它不等同于您要使用的表达式。
As you've said you're looking for正如你所说,你正在寻找
sha2(b, 256) = b_hash OR sha2(c, 256) = c_hash OR sha2(d, 256) = d_hash
while a chained filter on a DataFrame
results in而
DataFrame
上的链式过滤器DataFrame
导致
sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash
To achieve the former one you should change the accumulator:要实现前一个,您应该更改累加器:
import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.Column
val atLeastOneMatch: Column = map.foldLeft(lit(false)) {
case (acc, (c, h)) => acc or (sha2(col(c), 256) === h)
}
and then use the result to filter the data然后使用结果过滤数据
df.filter(atLeastOneMatch).count
This will count all the rows where at least one column matches a hash provided by the map
.这将计算至少一列与
map
提供的哈希匹配的所有行。 By De Morgan's laws its negation根据德摩根定律,它的否定
!atLeastOneMatch
will be equivalent to将等价于
sha2(b, 256) != b_hash AND sha2(c, 256) != c_hash AND sha2(d, 256) = d_hash
In other words it will match cases where non of the values matches corresponding hash.换句话说,它将匹配没有值匹配相应散列的情况。
If you want to find rows where at least one value is doesn't match a hash you should use如果您想查找至少一个值与哈希不匹配的行,您应该使用
sha2(b, 256) != b_hash OR sha2(c, 256) != c_hash OR sha2(d, 256) != d_hash
which can be composed as shown below可以组成如下所示
val atLeastOneMismatch: Column = map.foldLeft(lit(false)) {
case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h)
}
It's negation这是否定
!atLeastOneMismatch
is equivalent (De Morgan's laws once again)是等价的(再次德摩根定律)
sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash
and further equivalent to foldLeft
with DataFrame
accumulator and ===
.并且还相当于
foldLeft
与DataFrame
累加器和===
。
So to summarize - if C
is a set of columns, then:所以总结一下 - 如果
C
是一组列,那么:
atLeastOneMatch
atLeastOneMatch
!atLeastOneMatch
!atLeastOneMatch
atLeastOneMismatch
atLeastOneMismatch
!atLeastOneMismatch
!atLeastOneMismatch
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.