简体   繁体   English

使用带有 scala 的 Spark dataframe 中的 JSON 类型的列获取所有值,而不考虑键

[英]Fetch all values irrespective of keys from a column of JSON type in a Spark dataframe using Spark with scala

I was trying to load a TSV file containing some metadata on movies using Spark, this TSV file contains genre info on movies in JSON format [ The last column in every row ]我正在尝试使用 Spark 加载包含电影元数据的 TSV 文件,此 TSV 文件包含 JSON 格式的电影流派信息 [每行的最后一列]

Sample File示例文件

975900  /m/03vyhn   Ghosts of Mars  2001-08-24  14010832    98.0    {"/m/02h40lc": "English Language"}  {"/m/09c7w0": "United States of America"}   {"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}
3196793 /m/08yl5d   Getting Away with Murder: The JonBenét Ramsey Mystery   2000-02-16      95.0    {"/m/02h40lc": "English Language"}  {"/m/09c7w0": "United States of America"}   {"/m/02n4kr": "Mystery", "/m/03bxz7": "Biographical film", "/m/07s9rl0": "Drama", "/m/0hj3n01": "Crime Drama"}

I had tried the below code which enables me to access a particular value from the genre JSON我尝试了下面的代码,它使我能够访问 JSON 类型的特定值

val ss = SessionCreator.createSession("DataCleaning", "local[*]")//helper function creates a spark session and returns it
val headerInfoRb = ResourceBundle.getBundle("conf.headerInfo")
val movieDF = DataReader.readFromTsv(ss, "D:/Utility/Datasets/MovieSummaries/movie.metadata.tsv")
                .toDF(headerInfoRb.getString("metadataReader").split(',').toSeq:_*)//Datareader.readFromTsv is a helper function to read TSV file ,takes sparkSession and file path as input to resurn a dataframe, which uses sparkSession's read function 

movieDF.select("wiki_mv_id","mv_nm","mv_genre")
                .withColumn("genre_frmttd", get_json_object(col("mv_genre"), "$./m/02kdv5l"))
                .show(1,false)

Output Output

+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|wiki_mv_id|mv_nm         |mv_genre                                                                                                                                                                                  |genre_frmttd|
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|975900    |Ghosts of Mars|{"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}|Action      |
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
only showing top 1 row

I want genre_frmttd column in the way shown below for every row in the data Frame [ below snippet is for the first sample row ]对于数据帧中的每一行,我希望以下面显示的方式显示genre_frmttd列[下面的代码片段是第一个示例行]

[Thriller,Fiction,Horror,Adventure,Supernatural,Action,Space Western]

I am bit of a rookie in scala and spark, do suggest some way to list out the values我是 scala 和 spark 的菜鸟,建议一些列出值的方法

  1. parse the JSON using from_json使用from_json
  2. cast it to MapType(StringType, StringType)将其转换为MapType(StringType, StringType)
  3. extract only values using map_values使用map_values仅提取值
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}

movieDF.select("wiki_mv_id","mv_nm","mv_genre")
      .withColumn("genre_frmttd",map_values(from_json(col("mv_genre"),MapType(StringType, StringType))))
      .show(1,false)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark scala - 从数据帧列解析 json 并返回带有列的 RDD - Spark scala - parse json from dataframe column and return RDD with columns 使用 Spark 解析 Spark 数据框中的 JSON 列 - Parse a JSON column in a spark dataframe using Spark 数据帧列scala中的火花流JSON值 - spark streaming JSON value in dataframe column scala 使用 spark/scala 将 JSON 文件加入数据帧 - Joining JSON files into a dataframe using spark/scala 使用 spark/scala 按照 json 文件中首先列出的列的顺序将 json 转换为数据帧 - Convert json to dataframe in order of which column listed first in json file using spark/scala Scala:使用嵌套的 json 结构转换和替换 Spark DataFrame 的值 - Scala: Transform and replace values of Spark DataFrame with nested json structure 将 dataframe 列的数组展平为单独的列和 Spark scala 中的相应值 - Flattening the array of a dataframe column into separate columns and corresponding values in Spark scala 在 Spark Scala 中创建 JSON 列 - Create JSON column in Spark Scala 如何根据 id 将 spark dataframe 列的所有唯一值合并为单行并将列转换为 json 格式 - How to merge all unique values of a spark dataframe column into single row based on id and convert the column into json format 在本地使用 spark/scala 查询数据帧时,如何更改列中值的输出? - When querying a dataframe using spark/scala locally, how to change output of values in a column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM