[英]Get distinct words in a Spark DataFrame column
I have a df like this我有一个这样的df
val df2 = spark.createDataFrame(
Seq(
(0, "this is a sentence"),
(1, "And another sentence")
)
).toDF("num", "words")
and I would like to get the distinct words in this column like我想在这个专栏中得到不同的词,比如
val vocab = List("this", "is", "a", "sentence", "And", "another")
What is a scala/spark-esque way of achieving this?实现这一目标的 scala/spark-esque 方式是什么?
PS I know I could hack away at this with for loops and such but I am trying to get better at functional programming and more specifically spark and scala. PS 我知道我可以用 for 循环等来解决这个问题,但我正在努力提高函数式编程,更具体地说是 spark 和 scala。
Here is a very silly answer:这是一个非常愚蠢的答案:
import spark.implicits._
df2
.as[(Int, String)]
.flatMap { case (_, words) => words.split(' ') }
.distinct
.show(false)
I think this is what you want?我想这就是你想要的?
+--------+
|value |
+--------+
|sentence|
|this |
|is |
|a |
|And |
|another |
+--------+
Or were you more after a single row that contains all the distinct words?还是您更喜欢包含所有不同单词的单行?
(also this is my first ever stack overflow answer so pls be nice <3) (这也是我第一次堆栈溢出答案,所以请善待<3)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.