简体   繁体   中英

How to pass Set/HashSet as argument into UDF in spark?

I'm trying to write a function in filter, which will check whether the value is in a Set, I'm implementing that in UDF, but seems it cannot take Set/HashSet as argument.

set is get from:

testSet=existTableDF.select("Column1")
        .rdd.map(r=>r(0).asInstanceOf[String]).collect().toSet

udf:

def checkExistPlan(col1:String,testSet:Set[String]):Boolean={
if (testSet.contains(col1)){
      false
    } else
      true
}
val existFilter=udf((x:String,testSet:Set[String])=>checkExistPlan(x,testSet))

code when using udf:

testDF.filter(existFilter('Column1,lit(existMemberHashSet)))

When executing, the following error showed: Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class java.util.HashSet [Some value here]

First of all you probably want broadcasted set for the purpose of filtering. If you don't use broadcasted set then your map will be copied to all the executors for all the partitions link .

The problem in your code is you are creating literals from non-primitive types. You should try something like below:

var s : Set[String] = Set("1","3)
val broadcastedSet = spark.sparkContext.broadcast(s)

def checkExistPlan(col1:String):Boolean={
if (broadcastedSet.value.contains(col1)){
      true
    } else
      false
}

val existFilter=udf((x:String)=>checkExistPlan(x))
someDF.filter(existFilter($"number")).show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM