简体   繁体   中英

Spark - How to word count without RDD

It looks RDD is to be removed from Spark.

Announcement: DataFrame-based API is primary API

The RDD-based API is expected to be removed in Spark 3.0

Then, how to implement programs like word count in Spark?

The data you manipulate as tuples using RDD api can be thought of and manipulated as columns/fields in a SQL like manner using DataFrame api.

df.withColumn("word", explode(split(col("lines"), " ")))
  .groupBy("word")
  .count()
  .orderBy(col("count").desc())
  .show()
+---------+-----+
|     word|count|
+---------+-----+
|      foo|    5|
|      bar|    2|
|     toto|    1|
...
+---------+-----+

Notes:

  • This code snippet requires necessary imports from org.apache.spark.sql.functions
  • Relevant examples can be found in this question 's answers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM