[英]Converting a Spark's DataFrame column to List[String] in Scala
I am working on Movie Lens data set. 我正在处理电影镜头数据集。 In one the the
csv
files, the data is structured as: 在一个
csv
文件中,数据的结构为:
movieId
movieTitle
genres
movieId
movieTitle
genres
and genres
again is a list of |
genres
又是|
的列表 separated values, the field is nullable. 值分开,该字段为空。
I am trying to get a unique list of all the genres
so that I can rearrange the data as following: 我正在尝试获取所有
genres
的唯一列表,以便可以按以下方式重新排列数据:
movieId
movieTitle
genre1
genre2
...
genreN
movieId
movieTitle
genre1
genre2
...
genreN
and a row, which has genre
as genre1 | genre2
一行,
genre
为genre1 | genre2
genre1 | genre2
will look like: genre1 | genre2
看起来像:
1
Title1
1
1
0
...
0
1
Title1
1
1
0
...
0
So far, I have been able to read the csv
file using the following code: 到目前为止,我已经能够使用以下代码读取
csv
文件:
val conf = new SparkConf().setAppName(App.name).setMaster(App.sparkMaster)
val context = new SparkContext(conf)
val sparkSession = SparkSession.builder()
.appName(App.name)
.config("header", "true")
.config(conf = conf)
.getOrCreate()
val movieFrame: DataFrame = sparkSession.read.csv(moviesPath)
If I try something like: 如果我尝试以下操作:
movieFrame.rdd.map(row ⇒ row(2).asInstanceOf[String]).collect()
Then I get the following exception: 然后我得到以下异常:
java.lang.ClassNotFoundException: com.github.babbupandey.ReadData$$anonfun$1
Then, in addition, I tried providing the schema explicitly using the following code: 然后,此外,我尝试使用以下代码明确提供架构:
val moviesSchema: StructType = StructType(Array(StructField("movieId", StringType, nullable = true),
StructField("title", StringType, nullable = true),
StructField("genres", StringType, nullable = true)))
and tried: 并尝试:
val movieFrame: DataFrame = sparkSession.read.schema(moviesSchema).csv(moviesPath)
and then I got the same exception. 然后我得到了同样的例外。
Is there any way in which I can the set of genres
as a List
or a Set
so I can further massage the data into the desired format? 有什么方法可以将
genres
集作为List
或Set
以便进一步将数据整理成所需的格式? Any help will be appreciated. 任何帮助将不胜感激。
Here is how I got the set of genres: 这是我获得这套类型的方式:
val genreList: Array[String] = for (row <- movieFrame.select("genres").collect) yield row.getString(0)
val genres: Array[String] = for {
g ← genreList
genres ← g.split("\\|")
} yield genres
val genreSet : Set[String] = genres.toSet
This worked to give an Array[Array[String]] 这工作给了一个Array [Array [String]]
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect()
To get Array[String] 获取Array [String]
val genres = genreLst.flatten
or 要么
val genreLst = movieFrame.select("genres").rdd.map(r => r(0).asInstanceOf[String].split("\\|").map(_.toString).distinct).collect().flatten
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.