[英]Count calls of UDF in Spark
Using Spark 1.6.1 I want to call the number of times a UDF is called. 使用Spark 1.6.1我想调用UDF的调用次数。 I want to do this because I have a very expensive UDF (~1sec per call) and I suspect the UDF being called more often than the number of records in my dataframe, making my spark job slower than necessary .
我想这样做是因为我有一个非常昂贵的UDF(每次调用大约1秒), 我怀疑UDF被调用的次数比我数据帧中的记录数要多,这使得我的spark工作速度慢于必要 。
Although I could not reproduce this situation, I came up with a simple example showing that the number of calls to the UDF seems to be different (here: less) than the number of rows, how can that be? 虽然我无法重现这种情况,但我想出了一个简单的例子,显示对UDF的调用次数似乎与行数不同(此处:更少),这怎么可能?
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo extends App {
val conf = new SparkConf().setMaster("local[4]").setAppName("Demo")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val callCounter = sc.accumulator(0)
val df= sc.parallelize(1 to 10000,numSlices = 100).toDF("value")
println(df.count) // gives 10000
val myudf = udf((d:Int) => {callCounter.add(1);d})
val res = df.withColumn("result",myudf($"value")).cache
println(res.select($"result").collect().size) // gives 10000
println(callCounter.value) // gives 9941
}
If using an accumulator is not the right way to call the counts of the UDF, how else could I do it? 如果使用累加器不是调用UDF计数的正确方法,我还能怎样做呢?
Note: In my actual Spark-Job, get a call-count which is about 1.7 times higher than the actual number of records. 注意:在我的实际Spark-Job中,获得的呼叫计数大约是实际记录数的1.7倍。
Spark applications should define a main() method instead of extending scala.App. Spark应用程序应定义main()方法,而不是扩展scala.App。 Subclasses of scala.App may not work correctly.
scala.App的子类可能无法正常工作。
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo extends App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[4]")
val sc = new SparkContext(conf)
// [...]
}
}
This should solve your problem. 这应该可以解决您的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.