集群中的spark執行程序如何使用scala中聲明的變量

Question

我正在學習 Scala 和 Spark，我有一個要求，我對我為滿足要求而實施的方法有一些疑問。
首先，我將說明我的數據幀的外觀以及我想對數據幀執行的操作。

看起來如何

+-----------------+----------------------------+
|           Street|Total Passing Vehicle Volume|
+-----------------+----------------------------+
|      Kimball Ave|                       100  |
|      Ashland Ave|                       50   |
|         State St|                       110  |
|      Kimball Ave|                       40   |
|     Diversey Ave|                       60   |
|      Ashland Ave|                       70   |

如您所見，有一些重復的街道名稱。 所以要求是對每條街道通過的車輛總數求和，並添加一個包含計算總和的新列

它應該是這樣的

+-----------------+----------------------------+-------------+
|           Street|Total Passing Vehicle Volume|Total Vehicle|
+-----------------+----------------------------+-------------+
|      Kimball Ave|                       100  |       140   |
|      Ashland Ave|                       50   |       120   |
|         State St|                       110  |       110   |
|      Kimball Ave|                       40   |       140   |
|     Diversey Ave|                       60   |       60    |
|      Ashland Ave|                       70   |       120   |

我得到了我想要的東西，但在閱讀了一些文章后，我發現我的方法不好，因為它會在某些情況下失敗。

我的方法

var map: Map[String,Integer]= Map.empty;

  trafficDf.select(
    trafficDf.col("Street").cast("string"),
    trafficDf.col("Total Passing Vehicle Volume").cast("integer")
  ).foreach(r=>{
    if(!map.contains(r.get(0).toString)){
      map += r.get(0).toString -> r.get(1).asInstanceOf[Integer]
    }else{
      var m= map(r.get(0).toString);
      map += r.get(0).toString -> (m + r.get(1).asInstanceOf[Integer])
    }
  })

正如你所看到的，我已經聲明了一個Map並迭代了Street和Total Passing Vehicle Volume列並檢查街道列（個人記錄）是否存在於地圖中，通過將先前的值與當前值相加來更新值，其他只需插入與價值。

但是在閱讀了一些文章后，我認為將其部署到集群時會失敗，因為此執行將在多個執行程序之間進行分配，並且執行程序不會與它們一起使用Map實例，因此最終地圖甚至不會填充。

然后我閱讀了Closure ，它使用了一個不屬於函數的自由變量。 但是我聲明的 Map 也是一個自由變量（我認為）。

在這里，我添加了具有值的列：

var func = udf( (s:String) => {
    val d= map.get(s);     //getting the the value from map for each record in Street column
    d
  } )

val newTrafficFd= trafficDf.select($"Street",$"Total Passing Vehicle Volume",func($"Street").as("Total Vehicle"))
  newTrafficFd.show(20);

任何增強或任何建議？ 它會像我預期的那樣工作。

Answer 1

Stone，您不需要為此使用 RDD 或 UDF。 它可以通過像這樣的 Window 聚合來實現：

val trafficDf = Seq(
      ("Kimball Ave", 100),
      ("Ashland Ave", 50),
      ("State St", 110),
      ("Kimball Ave", 40),
      ("Diversey Ave", 60),
      ("Ashland Ave", 70)
    ).toDF("Street", "Total Passing Vehicle Volume")

trafficDf.withColumn("Total Vehicle", sum($"Total Passing Vehicle Volume").over(Window.partitionBy("Street")))
      .show()

輸出：

+------------+----------------------------+-------------+
|      Street|Total Passing Vehicle Volume|Total Vehicle|
+------------+----------------------------+-------------+
| Ashland Ave|                          50|          120|
| Ashland Ave|                          70|          120|
|Diversey Ave|                          60|           60|
| Kimball Ave|                         100|          140|
| Kimball Ave|                          40|          140|
|    State St|                         110|          110|
+------------+----------------------------+-------------+

解釋：

Windowing / Analytical functions是一項 ANSI SQL 功能，允許基於行組計算額外的聚合。

Spark 已經實現了這個特性，所以它可以很容易地在它的 DSL 中使用。

此功能允許我將Total Vehicle列計算為與每個Street值相關聯的Total Passing Vehicle Volume的總和。

您可以在此處了解有關一般窗口的更多信息：

https://www.vertabelo.com/blog/oracle-sql-analytical-functions-for-beginners-a-gentle-introduction-to-common-sql-window-functions/

或專門用於 Spark ：

http://queirozf.com/entries/spark-dataframe-examples-window-functions

集群中的spark執行程序如何使用scala中聲明的變量

問題描述

1 個解決方案

解決方案1
2 已采納 2020-01-02 10:28:57

集群中的spark執行程序如何使用scala中聲明的變量

問題描述

1 個解決方案

解決方案1 2 已采納 2020-01-02 10:28:57

解決方案1
2 已采納 2020-01-02 10:28:57