在Scala中将Spark的DataFrame列转换为List [String]

Question

I am working on Movie Lens data set. 我正在处理电影镜头数据集。 In one the the csv files, the data is structured as: 在一个csv文件中，数据的结构为：

movieId movieTitle genres movieId movieTitle genres

and genres again is a list of | genres又是|的列表 separated values, the field is nullable. 值分开，该字段为空。

I am trying to get a unique list of all the genres so that I can rearrange the data as following: 我正在尝试获取所有genres的唯一列表，以便可以按以下方式重新排列数据：

movieId movieTitle genre1 genre2 ... genreN movieId movieTitle genre1 genre2 ... genreN

and a row, which has genre as genre1 | genre2 一行， genre为genre1 | genre2 genre1 | genre2 will look like: genre1 | genre2看起来像：

1 Title1 1 1 0 ... 0 1 Title1 1 1 0 ... 0

So far, I have been able to read the csv file using the following code: 到目前为止，我已经能够使用以下代码读取csv文件：

val conf         = new SparkConf().setAppName(App.name).setMaster(App.sparkMaster)
val context      = new SparkContext(conf)
val sparkSession = SparkSession.builder()
                   .appName(App.name)
                   .config("header", "true")
                   .config(conf = conf)
                   .getOrCreate()

val movieFrame: DataFrame = sparkSession.read.csv(moviesPath)

If I try something like: 如果我尝试以下操作：

movieFrame.rdd.map(row ⇒ row(2).asInstanceOf[String]).collect()

Then I get the following exception: 然后我得到以下异常：

java.lang.ClassNotFoundException: com.github.babbupandey.ReadData$$anonfun$1

Then, in addition, I tried providing the schema explicitly using the following code: 然后，此外，我尝试使用以下代码明确提供架构：

val moviesSchema: StructType = StructType(Array(StructField("movieId", StringType, nullable = true),
                                                        StructField("title", StringType, nullable = true),
                                                        StructField("genres", StringType, nullable = true)))

and tried: 并尝试：

val movieFrame: DataFrame = sparkSession.read.schema(moviesSchema).csv(moviesPath)

and then I got the same exception. 然后我得到了同样的例外。

Is there any way in which I can the set of genres as a List or a Set so I can further massage the data into the desired format? 有什么方法可以将genres集作为List或Set以便进一步将数据整理成所需的格式？ Any help will be appreciated. 任何帮助将不胜感激。

Answer 1

Here is how I got the set of genres: 这是我获得这套类型的方式：

val genreList: Array[String] = for (row <- movieFrame.select("genres").collect) yield row.getString(0)
val genres: Array[String] =  for {
        g ← genreList
        genres ← g.split("\\|")
    } yield genres
val genreSet : Set[String] = genres.toSet

Answer 2

This worked to give an Array[Array[String]] 这工作给了一个Array [Array [String]]

    val genreLst = movieFrame.select("genres").rdd.map(r =>     r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect()

To get Array[String] 获取Array [String]

    val genres = genreLst.flatten

or 要么

    val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect().flatten

在Scala中将Spark的DataFrame列转换为List [String]

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-10-18 02:32:22

解决方案2
-1 2016-12-27 20:46:39

在Scala中将Spark的DataFrame列转换为List [String]

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-10-18 02:32:22

解决方案2 -1 2016-12-27 20:46:39

解决方案1
1 已采纳 2016-10-18 02:32:22

解决方案2
-1 2016-12-27 20:46:39