简体   繁体   中英

Get distinct words in a Spark DataFrame column

I have a df like this

val df2 = spark.createDataFrame(
  Seq(
    (0, "this is a sentence"),
    (1, "And another sentence")
    )
).toDF("num", "words")

and I would like to get the distinct words in this column like

val vocab = List("this", "is", "a", "sentence", "And", "another")

What is a scala/spark-esque way of achieving this?

PS I know I could hack away at this with for loops and such but I am trying to get better at functional programming and more specifically spark and scala.

Here is a very silly answer:

import spark.implicits._

df2
  .as[(Int, String)]
  .flatMap { case (_, words) => words.split(' ') }
  .distinct
  .show(false)

I think this is what you want?

+--------+
|value   |
+--------+
|sentence|
|this    |
|is      |
|a       |
|And     |
|another |
+--------+

Or were you more after a single row that contains all the distinct words?

(also this is my first ever stack overflow answer so pls be nice <3)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM