How to pass Set/HashSet as argument into UDF in spark?

Question

I'm trying to write a function in filter, which will check whether the value is in a Set, I'm implementing that in UDF, but seems it cannot take Set/HashSet as argument.

set is get from:

testSet=existTableDF.select("Column1")
        .rdd.map(r=>r(0).asInstanceOf[String]).collect().toSet

udf:

def checkExistPlan(col1:String,testSet:Set[String]):Boolean={
if (testSet.contains(col1)){
      false
    } else
      true
}
val existFilter=udf((x:String,testSet:Set[String])=>checkExistPlan(x,testSet))

code when using udf:

testDF.filter(existFilter('Column1,lit(existMemberHashSet)))

When executing, the following error showed: Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class java.util.HashSet [Some value here]

Answer 1

First of all you probably want broadcasted set for the purpose of filtering. If you don't use broadcasted set then your map will be copied to all the executors for all the partitions link .

The problem in your code is you are creating literals from non-primitive types. You should try something like below:

var s : Set[String] = Set("1","3)
val broadcastedSet = spark.sparkContext.broadcast(s)

def checkExistPlan(col1:String):Boolean={
if (broadcastedSet.value.contains(col1)){
      true
    } else
      false
}

val existFilter=udf((x:String)=>checkExistPlan(x))
someDF.filter(existFilter($"number")).show()

How to pass Set/HashSet as argument into UDF in spark?

Question

1 answers

solution1
0 ACCPTED 2019-08-01 06:15:42

How to pass Set/HashSet as argument into UDF in spark?

Question

1 answers

solution1 0 ACCPTED 2019-08-01 06:15:42

solution1
0 ACCPTED 2019-08-01 06:15:42