使用Spark Scala合並和聚合數據幀

Question

使用Spark Scala（1.6.2）進行轉換后，我有了一個數據集。 我得到了以下兩個數據框。

DF1：

|date | country | count|
| 1872| Scotland|     1|    
| 1873| England |     1|    
| 1873| Scotland|     1|    
| 1875| England |     1|    
| 1875| Scotland|     2|

DF2：

| date| country | count|
| 1872| England |     1|
| 1873| Scotland|     1|
| 1874| England |     1|
| 1875| Scotland|     1|
| 1875| Wales   |     1|

現在，從上面的兩個數據框中，我想按國家/地區按日期匯總。 像下面的輸出。 我嘗試使用聯合並加入，但無法獲得預期的結果。

上面兩個數據框的預期輸出：

| date| country | count|
| 1872| England |     1|
| 1872| Scotland|     1|
| 1873| Scotland|     2|
| 1873| England |     1|
| 1874| England |     1|
| 1875| Scotland|     3|
| 1875| Wales   |     1|
| 1875| England |     1|

請幫我解決。

Answer 1

最好的方法是執行一個聯合，然后執行兩列的groupBy運算，然后用總和指定要累加的列：

df1.unionAll(df2)
   .groupBy("date", "country")
   .sum("count")

輸出：

+----+--------+----------+
|date| country|sum(count)|
+----+--------+----------+
|1872|Scotland|         1|
|1875| England|         1|
|1873| England|         1|
|1875|   Wales|         1|
|1872| England|         1|
|1874| England|         1|
|1873|Scotland|         2|
|1875|Scotland|         3|
+----+--------+----------+

Answer 2

使用DataFrame API，您可以使用unionAll后跟groupBy來實現。

DF1.unionAll(DF2)
  .groupBy("date", "country")
  .agg(sum($"count").as("count"))

這首先會將來自兩個數據幀的所有行放入單個數據幀。 然后，通過對日期和國家/地區列進行分組，可以按要求按國家/地區按日期獲取計數列的總和。 as("count")部分重命名聚合列以進行計數。

注意：在較新的Spark版本（讀取版本2.0+）中， unionAll已棄用，並由union代替。

使用Spark Scala合並和聚合數據幀

問題描述

2 個解決方案

解決方案1
2 已采納 2018-02-21 07:16:21

解決方案2
2 2018-02-21 07:27:39

使用Spark Scala合並和聚合數據幀

問題描述

2 個解決方案

解決方案1 2 已采納 2018-02-21 07:16:21

解決方案2 2 2018-02-21 07:27:39

解決方案1
2 已采納 2018-02-21 07:16:21

解決方案2
2 2018-02-21 07:27:39