简体   繁体   English

使用Spark Scala合并和聚合数据帧

[英]Merging and aggregating dataframes using Spark Scala

I have a dataset, after transformation using Spark Scala (1.6.2). 使用Spark Scala(1.6.2)进行转换后,我有了一个数据集。 I got the following two dataframes. 我得到了以下两个数据框。

DF1: DF1:

|date | country | count|
| 1872| Scotland|     1|    
| 1873| England |     1|    
| 1873| Scotland|     1|    
| 1875| England |     1|    
| 1875| Scotland|     2|

DF2: DF2:

| date| country | count|
| 1872| England |     1|
| 1873| Scotland|     1|
| 1874| England |     1|
| 1875| Scotland|     1|
| 1875| Wales   |     1|

Now from above two dataframes, I want to get aggregate by date per country. 现在,从上面的两个数据框中,我想按国家/地区按日期汇总。 Like following output. 像下面的输出。 I tried using union and by joining but not able to get desired results. 我尝试使用联合并加入,但无法获得预期的结果。

Expected output from the two dataframes above: 上面两个数据框的预期输出:

| date| country | count|
| 1872| England |     1|
| 1872| Scotland|     1|
| 1873| Scotland|     2|
| 1873| England |     1|
| 1874| England |     1|
| 1875| Scotland|     3|
| 1875| Wales   |     1|
| 1875| England |     1|

Kindly help me get solution. 请帮我解决。

The best way is to perform an union and then an groupBy by the two columns, then with the sum, you can specify which column to add up: 最好的方法是执行一个联合,然后执行两列的groupBy运算,然后用总和指定要累加的列:

df1.unionAll(df2)
   .groupBy("date", "country")
   .sum("count")

Output: 输出:

+----+--------+----------+
|date| country|sum(count)|
+----+--------+----------+
|1872|Scotland|         1|
|1875| England|         1|
|1873| England|         1|
|1875|   Wales|         1|
|1872| England|         1|
|1874| England|         1|
|1873|Scotland|         2|
|1875|Scotland|         3|
+----+--------+----------+

Using the DataFrame API, you can use a unionAll followed by a groupBy to achive this. 使用DataFrame API,您可以使用unionAll后跟groupBy来实现。

DF1.unionAll(DF2)
  .groupBy("date", "country")
  .agg(sum($"count").as("count"))

This will first put all rows from the two dataframes into a single dataframe. 这首先会将来自两个数据帧的所有行放入单个数据帧。 Then, then by grouping on the date and country columns it's possible to get the aggregate sum of the count column by date per country as asked. 然后,通过对日期和国家/地区列进行分组,可以按要求按国家/地区按日期获取计数列的总和。 The as("count") part renames the aggregated column to count. as("count")部分重命名聚合列以进行计数。


Note: In newer Spark versions (read version 2.0+), unionAll is deprecated and is replaced by union . 注意:在较新的Spark版本(读取版本2.0+)中, unionAll已弃用,并由union代替。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM