在 Spark DataFrame 列中獲取不同的單詞

Question

我有一個這樣的df

val df2 = spark.createDataFrame(
  Seq(
    (0, "this is a sentence"),
    (1, "And another sentence")
    )
).toDF("num", "words")

我想在這個專欄中得到不同的詞，比如

val vocab = List("this", "is", "a", "sentence", "And", "another")

實現這一目標的 scala/spark-esque 方式是什么？

PS 我知道我可以用 for 循環等來解決這個問題，但我正在努力提高函數式編程，更具體地說是 spark 和 scala。

Answer 1

這是一個非常愚蠢的答案：

import spark.implicits._

df2
  .as[(Int, String)]
  .flatMap { case (_, words) => words.split(' ') }
  .distinct
  .show(false)

我想這就是你想要的？

+--------+
|value   |
+--------+
|sentence|
|this    |
|is      |
|a       |
|And     |
|another |
+--------+

還是您更喜歡包含所有不同單詞的單行？

（這也是我第一次堆棧溢出答案，所以請善待<3）