I need using a tester for Scala Spark filter, with tester implementing java's Predicate interface and receiving specific class name by arguments. I'm doing something like this
val tester = Class.forName(qualifiedName).newInstance().asInstanceOf[Predicate[T]]
var filtered = rdd.filter(elem => tester.test(elem))
The problem is that at runtime i have a Spark "TaskNotSerializable Exception" because my specific Predicate class is not Serializable.
If I do
val tester = Class.forName(qualifiedName).newInstance()
.asInstanceOf[Predicate[T] with Serializable]
var filtered = rdd.filter(elem => tester.test(elem))
I get the same error. If I create tester into rdd.filter call it works:
var filtered = rdd.filter { elem =>
val tester = Class.forName(qualifiedName).newInstance()
.asInstanceOf[Predicate[T] with Serializable]
tester.test(elem)
}
But I would create a single object (maybe to broadcast) for testing. How can I resolve?
You simply have to require the class implements Serializable
. Note that the asInstanceOf[Predicate[T] with Serializable]
cast is a lie: it doesn't actually check value is Serializable
, which is why the second case doesn't produce an error immediately during the cast, and the last one "succeeds".
But I would create a single object (maybe to broadcast) for testing.
You can't. Broadcast or not, deserialization will create new objects on worker nodes. But you can create only a single instance on each partition:
var filtered = rdd.mapPartitions { iter =>
val tester = Class.forName(qualifiedName).newInstance()
.asInstanceOf[Predicate[T]]
iter.filter(tester.test)
}
It will actually perform better than serializing the tester
, sending it, and deserializing it would, since it's strictly less work.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.