Scala/Spark - Find total number of value in row based on a key

Question

I have a large text file which contains the page views of some Wikimedia projects. (You can find it here if you're really interested) Each line, delimited by a space, contains the statistics for one Wikimedia page. The schema looks as follows: <project code> <page title> <num hits> <page size>

In Scala, using Spark RDDs or Dataframes, I wish to compute the total number of hits for each project, based on the project code. So for example for projects with the code "zw", I would like to find all the rows that begin with project code "zw", and add up their hits. Obviously this should be done for all project codes simultaneously.

I have looked at functions like aggregateByKey etc, but the examples I found don't go into enough detail, especially for a file with 4 fields. I imagine it's some kind of MapReduce job, but how exactly to implement it is beyond me.

Any help would be greatly appreciated.

Answer 1

First, you have to read the file in as a Dataset[String] . Then, parse each string into a tuple, so that it can be easily converted to a Dataframe . Once you have a Dataframe , a simple .GroupBy().agg() is enough to finish the computation.

import org.apache.spark.sql.functions.sum

val df = spark.read.textFile("/tmp/pagecounts.gz").map(l => {
    val a = l.split(" ")
    (a(0), a(2).toLong)
}).toDF("project_code", "num_hits")

val agg_df = df.groupBy("project_code")
  .agg(sum("num_hits").as("total_hits"))
  .orderBy($"total_hits".desc)

agg_df.show(10)

The above snippet shows the top 10 project codes by total hits.

+------------+----------+
|project_code|total_hits|
+------------+----------+
|       en.mw|   5466346|
|          en|   5310694|
|       es.mw|    695531|
|       ja.mw|    611443|
|       de.mw|    572119|
|       fr.mw|    536978|
|       ru.mw|    466742|
|          ru|    463437|
|          es|    400632|
|       it.mw|    400297|
+------------+----------+

It is certainly also possible to do this with the older API as an RDD map/reduce, but you lose many of the optimizations that Dataset / Dataframe api brings.

Scala/Spark - Find total number of value in row based on a key

Question

1 answers

solution1
1 ACCPTED 2020-03-11 13:29:11

Scala/Spark - Find total number of value in row based on a key

Question

1 answers

solution1 1 ACCPTED 2020-03-11 13:29:11

solution1
1 ACCPTED 2020-03-11 13:29:11