简体   繁体   中英

Merging and aggregating dataframes using Spark Scala

I have a dataset, after transformation using Spark Scala (1.6.2). I got the following two dataframes.

DF1:

|date | country | count|
| 1872| Scotland|     1|    
| 1873| England |     1|    
| 1873| Scotland|     1|    
| 1875| England |     1|    
| 1875| Scotland|     2|

DF2:

| date| country | count|
| 1872| England |     1|
| 1873| Scotland|     1|
| 1874| England |     1|
| 1875| Scotland|     1|
| 1875| Wales   |     1|

Now from above two dataframes, I want to get aggregate by date per country. Like following output. I tried using union and by joining but not able to get desired results.

Expected output from the two dataframes above:

| date| country | count|
| 1872| England |     1|
| 1872| Scotland|     1|
| 1873| Scotland|     2|
| 1873| England |     1|
| 1874| England |     1|
| 1875| Scotland|     3|
| 1875| Wales   |     1|
| 1875| England |     1|

Kindly help me get solution.

The best way is to perform an union and then an groupBy by the two columns, then with the sum, you can specify which column to add up:

df1.unionAll(df2)
   .groupBy("date", "country")
   .sum("count")

Output:

+----+--------+----------+
|date| country|sum(count)|
+----+--------+----------+
|1872|Scotland|         1|
|1875| England|         1|
|1873| England|         1|
|1875|   Wales|         1|
|1872| England|         1|
|1874| England|         1|
|1873|Scotland|         2|
|1875|Scotland|         3|
+----+--------+----------+

Using the DataFrame API, you can use a unionAll followed by a groupBy to achive this.

DF1.unionAll(DF2)
  .groupBy("date", "country")
  .agg(sum($"count").as("count"))

This will first put all rows from the two dataframes into a single dataframe. Then, then by grouping on the date and country columns it's possible to get the aggregate sum of the count column by date per country as asked. The as("count") part renames the aggregated column to count.


Note: In newer Spark versions (read version 2.0+), unionAll is deprecated and is replaced by union .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM