简体   繁体   中英

Scala/Spark - Find total number of value in row based on a key

I have a large text file which contains the page views of some Wikimedia projects. (You can find it here if you're really interested) Each line, delimited by a space, contains the statistics for one Wikimedia page. The schema looks as follows: <project code> <page title> <num hits> <page size>

In Scala, using Spark RDDs or Dataframes, I wish to compute the total number of hits for each project, based on the project code. So for example for projects with the code "zw", I would like to find all the rows that begin with project code "zw", and add up their hits. Obviously this should be done for all project codes simultaneously.

I have looked at functions like aggregateByKey etc, but the examples I found don't go into enough detail, especially for a file with 4 fields. I imagine it's some kind of MapReduce job, but how exactly to implement it is beyond me.

Any help would be greatly appreciated.

First, you have to read the file in as a Dataset[String] . Then, parse each string into a tuple, so that it can be easily converted to a Dataframe . Once you have a Dataframe , a simple .GroupBy().agg() is enough to finish the computation.

import org.apache.spark.sql.functions.sum

val df = spark.read.textFile("/tmp/pagecounts.gz").map(l => {
    val a = l.split(" ")
    (a(0), a(2).toLong)
}).toDF("project_code", "num_hits")

val agg_df = df.groupBy("project_code")
  .agg(sum("num_hits").as("total_hits"))
  .orderBy($"total_hits".desc)

agg_df.show(10)

The above snippet shows the top 10 project codes by total hits.

+------------+----------+
|project_code|total_hits|
+------------+----------+
|       en.mw|   5466346|
|          en|   5310694|
|       es.mw|    695531|
|       ja.mw|    611443|
|       de.mw|    572119|
|       fr.mw|    536978|
|       ru.mw|    466742|
|          ru|    463437|
|          es|    400632|
|       it.mw|    400297|
+------------+----------+

It is certainly also possible to do this with the older API as an RDD map/reduce, but you lose many of the optimizations that Dataset / Dataframe api brings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM