在 Spark DataFrame 列中获取不同的单词

Question

I have a df like this我有一个这样的df

val df2 = spark.createDataFrame(
  Seq(
    (0, "this is a sentence"),
    (1, "And another sentence")
    )
).toDF("num", "words")

and I would like to get the distinct words in this column like我想在这个专栏中得到不同的词，比如

val vocab = List("this", "is", "a", "sentence", "And", "another")

What is a scala/spark-esque way of achieving this?实现这一目标的 scala/spark-esque 方式是什么？

PS I know I could hack away at this with for loops and such but I am trying to get better at functional programming and more specifically spark and scala. PS 我知道我可以用 for 循环等来解决这个问题，但我正在努力提高函数式编程，更具体地说是 spark 和 scala。

Answer 1

Here is a very silly answer:这是一个非常愚蠢的答案：

import spark.implicits._

df2
  .as[(Int, String)]
  .flatMap { case (_, words) => words.split(' ') }
  .distinct
  .show(false)

I think this is what you want?我想这就是你想要的？

+--------+
|value   |
+--------+
|sentence|
|this    |
|is      |
|a       |
|And     |
|another |
+--------+

Or were you more after a single row that contains all the distinct words?还是您更喜欢包含所有不同单词的单行？

(also this is my first ever stack overflow answer so pls be nice <3) （这也是我第一次堆栈溢出答案，所以请善待<3）

在 Spark DataFrame 列中获取不同的单词

问题描述

1 个解决方案

解决方案1
1 2021-04-12 22:43:10

在 Spark DataFrame 列中获取不同的单词

问题描述

1 个解决方案

解决方案1 1 2021-04-12 22:43:10

解决方案1
1 2021-04-12 22:43:10