简体   繁体   English

在 Spark DataFrame 列中获取不同的单词

[英]Get distinct words in a Spark DataFrame column

I have a df like this我有一个这样的df

val df2 = spark.createDataFrame(
  Seq(
    (0, "this is a sentence"),
    (1, "And another sentence")
    )
).toDF("num", "words")

and I would like to get the distinct words in this column like我想在这个专栏中得到不同的词,比如

val vocab = List("this", "is", "a", "sentence", "And", "another")

What is a scala/spark-esque way of achieving this?实现这一目标的 scala/spark-esque 方式是什么?

PS I know I could hack away at this with for loops and such but I am trying to get better at functional programming and more specifically spark and scala. PS 我知道我可以用 for 循环等来解决这个问题,但我正在努力提高函数式编程,更具体地说是 spark 和 scala。

Here is a very silly answer:这是一个非常愚蠢的答案:

import spark.implicits._

df2
  .as[(Int, String)]
  .flatMap { case (_, words) => words.split(' ') }
  .distinct
  .show(false)

I think this is what you want?我想这就是你想要的?

+--------+
|value   |
+--------+
|sentence|
|this    |
|is      |
|a       |
|And     |
|another |
+--------+

Or were you more after a single row that contains all the distinct words?还是您更喜欢包含所有不同单词的单行?

(also this is my first ever stack overflow answer so pls be nice <3) (这也是我第一次堆栈溢出答案,所以请善待<3)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 获取火花数据帧中 ArrayType 列的不同元素 - get the distinct elements of an ArrayType column in a spark dataframe 从Spark数据框列中ArrayType类型的行中获取不同的元素 - Get distinct elements from rows of type ArrayType in Spark dataframe column Scala-Spark-如何获取具有数据框列的不同值以及此不同值的第一个日期的新数据框? - Scala - Spark - How can I get a new dataframe with distinct values of a dataframe column and the first date of this distinct values? 分类字段基于Spark Dataframe中的不同值 - Categories column on the basis of distinct value in Spark Dataframe 使用 Spark DataFrame 获取列上的不同值 - Fetching distinct values on a column using Spark DataFrame 从Spark数据框中的列生成不同的值 - Generate distinct values from a column in a spark dataframe 将Spark数据框列的不同值转换为列表 - Converting distinct values of a Spark dataframe column into a list 从火花数据框中的字符串列中提取单词 - Extract words from a string column in spark dataframe 如何获得不同的值,dataframe 中的一列计数并使用 Spark2 和 Scala - How to get distinct value, count of a column in dataframe and store in another dataframe as (k,v) pair using Spark2 and Scala 如何计算Spark数据帧的列中每个不同元素的出现次数 - How to count the number of occurrences of each distinct element in a column of a spark dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM