Get distinct words in a Spark DataFrame column

Question

I have a df like this

val df2 = spark.createDataFrame(
  Seq(
    (0, "this is a sentence"),
    (1, "And another sentence")
    )
).toDF("num", "words")

and I would like to get the distinct words in this column like

val vocab = List("this", "is", "a", "sentence", "And", "another")

What is a scala/spark-esque way of achieving this?

PS I know I could hack away at this with for loops and such but I am trying to get better at functional programming and more specifically spark and scala.

Answer 1

Here is a very silly answer:

import spark.implicits._

df2
  .as[(Int, String)]
  .flatMap { case (_, words) => words.split(' ') }
  .distinct
  .show(false)

I think this is what you want?

+--------+
|value   |
+--------+
|sentence|
|this    |
|is      |
|a       |
|And     |
|another |
+--------+

Or were you more after a single row that contains all the distinct words?

(also this is my first ever stack overflow answer so pls be nice <3)

Get distinct words in a Spark DataFrame column

Question

1 answers

solution1
1 2021-04-12 22:43:10

Get distinct words in a Spark DataFrame column

Question

1 answers

solution1 1 2021-04-12 22:43:10

solution1
1 2021-04-12 22:43:10