What is the best way to write Optimized UDF in spark streaming application with Scala?

Question

I am working with Spark Streaming application where I Need to consume data from one Kafka topic and need to push into another Kafka topic.

I have created an UDF function which do some business logics that's not available with Inbuilt Spark SQL/Other functions

Object TestingObject Extetnds Serializble{

def userdefined_function(String:row_string):String = {
return "Data After Business Logic"
}

def main(args: Array[String]): Unit = {
kafkaStream.foreachRDD(foreachFunc = rdd => {
      if (!rdd.isEmpty()) {
val df = ss.read.option("mode", "DROPMALFORMED").json(ss.createDataset(newRDD)(Encoders.STRING))
        val Enricheddf = df.toJSON.foreach(row => {
val data = userdefined_function(row);
kafkaproducer.send(topicname,data)
})
}}
}

I know using UDF in spark application is very costly. But in my business logic I dont have other way, So I should use with my application.

My Question is how to optimize My UDF function in spark Scala streaming application?

Can I use UDF inside main function? OR Can I use UDF in foreach function (Each row)? OR Can I put UDF in different class and Broadcast that Class with Spark? OR What should I do. ? Can anyone give suggestion for this? Thanks in Advance.

Answer 1

Welcome to StackOverflow. I will try to clarify some points:

There are several main concepts that you should know about Spark in your code:

The main function, like in other languages is the entry point of your application, so your question of if you can use an UDF inside the main function, yees, you can use whatever you want there.
The concept of UDF is applied to Spark SQL world. That means that this concept is closely related to Spark Dataframes.
You are using the old Spark Streaming implementation. Normally, you should use the Spark Structured Streaming api. The Spark Streaming spec that you use is built over the RDD api. For each mini-batch you can manipulate the incoming messages as RDDs, there is no UDF here, you apply plain Scala functions to every mini-batch.
Don´t create a new dataframe each minibatch. You don´t need to do that. Your data is already distributed across the executors. You can apply whatever you want using plain Scala code in the foreachRDD using RDD´s map function, for example. Imagine if you have thousands of minibatchs...
Related to UDFs. They are very useful, you have to consider that they are are black boxes for the Spark Optimizer, because you can use whatever you want inside them and Spark won´t be able to inspect your code to create the execution plan the way it considers it would be the most efficient way.
When you use an UDF, Spark has to serialize/deserialize(power comes at a cost) the data representation from Spark to Scala types and viceversa, so there is an additional cost, but it does not mean that you have to avoid them, sometimes they are very useful. Besides the additional GC overhead. So avoid to use heavy objects inside them, for example use plain arrays, instead of tuples or big case classes.

What is the best way to write Optimized UDF in spark streaming application with Scala?

Question

1 answers

solution1
0 2022-05-16 15:42:55

What is the best way to write Optimized UDF in spark streaming application with Scala?

Question

1 answers

solution1 0 2022-05-16 15:42:55

solution1
0 2022-05-16 15:42:55