All I want to achieve is: Image 1 So in here you can see in first Image, I have data frame in which first 4 row has correct hash values stored in corresponding columns("col_1_hash" has hash value of "col_1" and "col_2_hash" has hash value of "col_2") in row 5 both has worng hash values (col_1:a, col_2:z, col_1_hash: has hash value of "z", col_2_hash: has hash value of "a") and row 6 has one right and one worng values(col_1:d, col_2:w, col_1_hash: has hash value of "d"(correct), col_2_hash: has hash value of "z"(wrong))
val totallytemp = xtranwedf.filter(( sha2($"col_1",256) =!= $"col_1_hash") ||
(sha2($"col_2",256) =!= $"col_2_hash"))
val total = totallytemp.count
this will give output:
total: Long = 2
Above results is what I want to achieve with foldLeft. As there is two records where atleastonematch is there.
now in here I know there easy way to achieve this but it's just I don't want to pass hard-coded values.
So I am performing collect on dataframe and getting list of values and creatinig map out of it. you will see in second image. Image 2 so in here I am passing map and creating accumulator but it doesn't give answer it should. as you will see in image 1 answer I want is 2 but it this code gives answer 6.
val templist = "col_1" :: "col_2" :: Nil
val tempmapingList = Map(templist map {s => (s, s + "_hash")} : _*)
val expr: Column = tempmapingList.foldLeft(lit(false))
{
case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h)
}
xtranwedf.filter(expr).count
this gives output:
total: Long = 6
I want here to be 2. but I think it has something to do with the === or = sign where it is not creating new column on which I can perform count.
The problem with your foldLeft
application is that it is not equivalent to the expression you want to use.
As you've said you're looking for
sha2(b, 256) = b_hash OR sha2(c, 256) = c_hash OR sha2(d, 256) = d_hash
while a chained filter on a DataFrame
results in
sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash
To achieve the former one you should change the accumulator:
import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.Column
val atLeastOneMatch: Column = map.foldLeft(lit(false)) {
case (acc, (c, h)) => acc or (sha2(col(c), 256) === h)
}
and then use the result to filter the data
df.filter(atLeastOneMatch).count
This will count all the rows where at least one column matches a hash provided by the map
. By De Morgan's laws its negation
!atLeastOneMatch
will be equivalent to
sha2(b, 256) != b_hash AND sha2(c, 256) != c_hash AND sha2(d, 256) = d_hash
In other words it will match cases where non of the values matches corresponding hash.
If you want to find rows where at least one value is doesn't match a hash you should use
sha2(b, 256) != b_hash OR sha2(c, 256) != c_hash OR sha2(d, 256) != d_hash
which can be composed as shown below
val atLeastOneMismatch: Column = map.foldLeft(lit(false)) {
case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h)
}
It's negation
!atLeastOneMismatch
is equivalent (De Morgan's laws once again)
sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash
and further equivalent to foldLeft
with DataFrame
accumulator and ===
.
So to summarize - if C
is a set of columns, then:
atLeastOneMatch
!atLeastOneMatch
atLeastOneMismatch
!atLeastOneMismatch
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.