简体   繁体   中英

Scala spark DataFrame columns as map and compare them using foldleft

All I want to achieve is: Image 1 So in here you can see in first Image, I have data frame in which first 4 row has correct hash values stored in corresponding columns("col_1_hash" has hash value of "col_1" and "col_2_hash" has hash value of "col_2") in row 5 both has worng hash values (col_1:a, col_2:z, col_1_hash: has hash value of "z", col_2_hash: has hash value of "a") and row 6 has one right and one worng values(col_1:d, col_2:w, col_1_hash: has hash value of "d"(correct), col_2_hash: has hash value of "z"(wrong))

val totallytemp = xtranwedf.filter(( sha2($"col_1",256)  =!= $"col_1_hash") ||
  (sha2($"col_2",256)  =!= $"col_2_hash"))
val total = totallytemp.count

this will give output:

total: Long = 2

Above results is what I want to achieve with foldLeft. As there is two records where atleastonematch is there.

now in here I know there easy way to achieve this but it's just I don't want to pass hard-coded values.

So I am performing collect on dataframe and getting list of values and creatinig map out of it. you will see in second image. Image 2 so in here I am passing map and creating accumulator but it doesn't give answer it should. as you will see in image 1 answer I want is 2 but it this code gives answer 6.

val templist = "col_1" :: "col_2" :: Nil
val tempmapingList = Map(templist map {s => (s, s + "_hash")} : _*)

val expr: Column = tempmapingList.foldLeft(lit(false)) 
  { 
  case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h) 
  }
xtranwedf.filter(expr).count

this gives output:

total: Long = 6

I want here to be 2. but I think it has something to do with the === or = sign where it is not creating new column on which I can perform count.

The problem with your foldLeft application is that it is not equivalent to the expression you want to use.

As you've said you're looking for

sha2(b, 256) = b_hash OR sha2(c, 256) = c_hash OR sha2(d, 256) = d_hash

while a chained filter on a DataFrame results in

sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash

To achieve the former one you should change the accumulator:

import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.Column

val atLeastOneMatch: Column = map.foldLeft(lit(false)) { 
  case (acc, (c, h)) => acc or (sha2(col(c), 256) === h) 
}

and then use the result to filter the data

df.filter(atLeastOneMatch).count

This will count all the rows where at least one column matches a hash provided by the map . By De Morgan's laws its negation

!atLeastOneMatch

will be equivalent to

sha2(b, 256) != b_hash AND sha2(c, 256) != c_hash AND sha2(d, 256) = d_hash

In other words it will match cases where non of the values matches corresponding hash.

If you want to find rows where at least one value is doesn't match a hash you should use

sha2(b, 256) != b_hash OR sha2(c, 256) != c_hash OR sha2(d, 256) != d_hash

which can be composed as shown below

val atLeastOneMismatch: Column = map.foldLeft(lit(false)) { 
  case (acc, (c, h)) => acc or (sha2(col(c), 256) =!= h) 
}

It's negation

!atLeastOneMismatch

is equivalent (De Morgan's laws once again)

sha2(b, 256) = b_hash AND sha2(c, 256) = c_hash AND sha2(d, 256) = d_hash

and further equivalent to foldLeft with DataFrame accumulator and === .

So to summarize - if C is a set of columns, then:

  • ∃c∈C map(c) = sha2(c, 256) - atLeastOneMatch
  • ∀c∈C map(c) != sha2(c, 256) - !atLeastOneMatch
  • ∃c∈C map(c) != sha2(c, 256) - atLeastOneMismatch
  • ∀c∈C map(c) = sha2(c, 256) - !atLeastOneMismatch

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM