火花 UDF 統計

Question

計算 UDF 的正面調用結果的最佳方法是什么？ 我有一個 UDF(java)，它轉換列中每個字段的值，並在滿足四個條件時將其分配給新列。 如果不是，則值為 null。 很高興知道哪個條件沒有得到滿足，但是現在，了解成功/不成功調用的總比例就可以了。 由於這個 DF 非常龐大，不可能記錄每個調用，所以我正在考慮創建一個計數器或緩存作為 UDF 的一部分，並且在作業完成記錄結果或將它們寫入數據庫之后 - 作業每隔幾個小時運行一次多個工人，所以它不會很貴。

Answer 1

您可以使用累加器來創建統計信息。

以下示例包含兩個計數器，只要complexUdf分別返回奇數或偶數，它們就會增加。

SparkSession spark = ...

//create the two accumulators
LongAccumulator oddAccumulator = spark.sparkContext().longAccumulator("oddAccumulator");
LongAccumulator evenAccumulator = spark.sparkContext().longAccumulator("evenAccumulator");

//create the udf
//note: within the udf do not try to read the current values of the accumulators
spark.udf().register("complexUdf", (UDF1<Long, Long>)l -> {
  long result = l + 1;
  if( result%2==0) {
    evenAccumulator.add(1);
  }
  else {
    oddAccumulator.add(1);
  }
  return result;
}, DataTypes.LongType);

//create a toy dataset
Dataset<Long> df = spark.createDataset(Arrays.asList(1L, 2L, 3L), Encoders.LONG());

//call the udf
df.withColumn("processedData",callUDF("complexUdf", col("value")) ).show();

//in the driver process the value method of the accumulators return the expected values
System.out.println(String.format("the udf returned %d odd and even %d numbers", 
  oddAccumulator.value(), 
  evenAccumulator.value()));

Output：

+-----+-------------+
|value|processedData|
+-----+-------------+
|    1|            2|
|    2|            3|
|    3|            4|
+-----+-------------+

the udf returned 1 odd and even 2 numbers

火花 UDF 統計

問題描述

1 個解決方案

解決方案1
0 2022-08-20 19:34:17

火花 UDF 統計

問題描述

1 個解決方案

解決方案1 0 2022-08-20 19:34:17

解決方案1
0 2022-08-20 19:34:17