如何處理spark reduceByKey函數中的空值？

Question

我有一個Spark DataFrame（df），如下所示：

+----------+--------+----------+--------+                                                                 
|        c1|      c2|        c3|      c4|
+----------+--------+----------+--------+
|      1   |    5   |      null|    7   |
+----------+--------+----------+--------+
|      1   |    5   |      4   |    8   |
+----------+--------+----------+--------+
|      1   |    3   |      null|   11   |
+----------+--------+----------+--------+
|      1   |    3   |      null| null   |
+----------+--------+----------+--------+
|      2   |    6   |      23  |   17   |
+----------+--------+----------+--------+
|      2   |    6   |      7   |    3   |
+----------+--------+----------+--------+
|      2   |    3   |      null|   11   |
+----------+--------+----------+--------+
|      2   |    3   |      null|   17   |
+----------+--------+----------+--------+

我想使用(c1,c2)作為鍵進行聚合(c1,c2)並且具有c3和c4 average ，所以我有這個：

+----------+--------+----------+--------+                                                                 
|        c1|      c2|        c3|      c4|
+----------+--------+----------+--------+
|      1   |    5   |      4   |  7.5   |
+----------+--------+----------+--------+
|      1   |    3   |      null|   11   |
+----------+--------+----------+--------+
|      2   |    6   |      15  |    10  |
+----------+--------+----------+--------+
|      2   |    3   |      null|   14   |
+----------+--------+----------+--------+

所以，基本上我忽略了null值。

我的半生不熟的代碼看起來像這樣：

val df1 = df.
          // just working on c3 for time being
          map(x => ((x.getInt(0), x.getInt(1)), x.getDouble(3))).
          reduceByKey( 
            (x, y) => {
            var temp = 0
            var sum = 0.0
            var flag = false
            if (x == null) {
              if (y != null) {
                temp = temp + 1
                sum = y
                flag = true
              }
            } else {
              if (y == null) {
                temp = temp + 1
                sum = x 
              } else {
                temp = temp + 1
                sum = x + y
                flag = true
              } 
            } 
            if (flag == false) {
              null 
            } else {
              sum/temp 
            }
            }
          )

顯然，上面的代碼不起作用。 任何幫助使代碼工作非常感謝。

編輯1 @ zero232給出的答案是一個解決方案。 但是，它不是我正在尋找的“解決方案”。 我的興趣是在為reduceByKey()編寫自定義函數時理解如何處理空值 。 我重新問下面的問題 ：

我想使用(c1,c2)作為關鍵聚合並且具有c3和c4 root mean square [{sum（a_i ^ 2）} ^ 0.5]（或者某些在火花中不可用的函數）而忽略了null值，所以我有這個：

+----------+--------+----------+--------+                                                                 
|        c1|      c2|        c3|      c4|
+----------+--------+----------+--------+
|      1   |    5   |      4   | 10.63  |
+----------+--------+----------+--------+
|      1   |    3   |      null|   11   |
+----------+--------+----------+--------+
|      2   |    6   |   24.04  |  17.26 |
+----------+--------+----------+--------+
|      2   |    3   |      null| 20.24  |
+----------+--------+----------+--------+

Answer 1

只需groupBy並使用mean ：

df.groupBy("c1", "c2").mean("c3", "c4")

或agg

df.groupBy("c1", "c2").agg(avg("c3"), avg("c4"))

通常， DataFrames上的所有基本函數DataFrames將正確處理null值。

import org.apache.spark.sql.functions._

def rms(c: String) = sqrt(avg(pow(col(c), 2))).alias(s"rms($c)")
df.groupBy("c1", "c2").agg(rms("c3"), rms("c4"))

如果要在RDDs忽略null ，只需在應用縮減之前將其過濾掉：

somePairRDD.filter(_._2 != null)
  .foldByKey(someDefualtValue)(someReducingFunction)

或將值轉換為Option並使用模式匹配：

somePairRDD.mapValues(Option(_)).reduceByKey {
  case (Some(x), Some(y)) => doSomething(x, y)
  case (Some(x), _) => doSomething(x)
  case (_, Some(_)) => doSomething(y)
  case _ => someDefualt
}

或使用map / flatMap / getOrElse和其他標准工具來處理未定義的值。

如何處理spark reduceByKey函數中的空值？

問題描述

1 個解決方案

解決方案1
3 已采納 2016-03-30 05:09:15

如何處理spark reduceByKey函數中的空值？

問題描述

1 個解決方案

解決方案1 3 已采納 2016-03-30 05:09:15

解決方案1
3 已采納 2016-03-30 05:09:15