将累加器传递给 spark udf

Question

This is a simplified version of what I am trying to do.这是我正在尝试做的事情的简化版本。 I want to do some counting inside my udf.我想在我的 udf 中做一些计数。 So thinking one way of doing it is to pass Long accumulators to the udf and incrementing the acuumulators inside the if else loops in deserializeProtobuf function. But not able to get the syntax working.因此，考虑这样做的一种方法是将 Long 累加器传递给 udf，并在 deserializeProtobuf function 的 if else 循环中递增累加器。但无法使语法正常工作。 Can anyone help me with that?任何人都可以帮我吗？ Is there any better way?有没有更好的办法？

def deserializeProtobuf(raw_data: Byte[Array]) = {

    val input_stream = new ByteArrayInputStream(raw_data)
    parsed_data = CustomClass.parseFrom(input_stream)

    if (condition 1 related to parsed_data) {
        < increment variable1 > 
    } 
    else if (condition 2 related to parsed_data) {
        < increment variable2 > 
    } 
    else {
        < increment variable3 > 
    }
    
}



val decode = udf(deserializeProtobuf _)
      
val deserialized_data = ds.withColumn("data", decode(col("protobufData")))

Answer 1

I have done something like this before, If you are doing heavy-lifting in your CUSTOMCLASS one thing I can suggest is to Broadcast it, also you can instantiate Metrics on BroadCast variable我以前做过这样的事情，如果你在你的 CUSTOMCLASS 中做繁重的工作，我可以建议的一件事是广播它，你也可以在广播变量上实例化 Metrics

Now coming to counting part I tried accumulator part but it was quite difficult to manage them inside UDF as getting correct count over a window so I tried to use spark-metrics and send the count at regular interval现在开始计数部分，我尝试了累加器部分，但是很难在 UDF 中管理它们，因为在 window 上获得正确的计数，所以我尝试使用 spark-metrics 并定期发送计数

use this https://github.com/groupon/spark-metrics使用这个https://github.com/groupon/spark-metrics

and make sure initialise the metrics on Broadcast variable creation time from that point the copied variable will report on same metrics并确保从该点初始化广播变量创建时间的指标，复制的变量将报告相同的指标

Answer 2

You shouldn't have to pass the accumulator to the UDF:您不必将累加器传递给 UDF：

import org.apache.spark.util.{AccumulatorV2, LongAccumulator}
import org.apache.spark.sql.functions.{udf,col}

var acc1: LongAccumulator = null
def my_udf = udf ( (arg1: str) => {
     ...
     acc1.add(1)
}
val spark = SparkSession...
acc1      = spark.sparkContext.longAccumulator("acc1")
... withColumn("col_name", my_udf(col("...")))
// some action here to cause the withColumn to execute
System.err.println(s"${acc1.value}")

将累加器传递给 spark udf

问题描述

2 个解决方案

解决方案1
0 2021-07-26 04:13:35

解决方案2
0 2022-02-26 16:46:06

将累加器传递给 spark udf

问题描述

2 个解决方案

解决方案1 0 2021-07-26 04:13:35

解决方案2 0 2022-02-26 16:46:06

解决方案1
0 2021-07-26 04:13:35

解决方案2
0 2022-02-26 16:46:06