Spark 2.0（而非2.1）Dataset [Row]或Dataframe-选择几列以JSON

Question

I have a Spark Dataframe with 10 columns and I need to store this in Postgres/ RDBMS. 我有一个包含10列的Spark Dataframe，需要将其存储在Postgres / RDBMS中。 The table has 7 columns and 7th column takes in text (of JSON format) for further processing. 该表有7列，第7列采用文本（JSON格式）进行进一步处理。

How do I select 6 columns and convert the remaining 4 columns in the DF to JSON format? 如何选择6列并将DF中的其余4列转换为JSON格式？

If the whole DF is to be stored as JSON, then we could use DF.write.format("json"), but only the last 4 columns are required to be in JSON format. 如果将整个DF存储为JSON，则可以使用DF.write.format（“ json”），但仅要求最后4列为JSON格式。

I tried creating a UDF (with either Jackson or Lift lib), but not successful in sending the 4 columns to the UDF. 我尝试创建UDF（使用Jackson或Lift lib），但未成功将4列发送到UDF。

for JSON, the DF column name is the key, DF column's value is the value. 对于JSON，DF列名称是键，DF列的值是值。

eg: 例如：

dataset name: ds_base
root
 |-- bill_id: string (nullable = true)
 |-- trans_id: integer (nullable = true)
 |-- billing_id: decimal(3,-10) (nullable = true)
 |-- asset_id: string (nullable = true)
 |-- row_id: string (nullable = true)
 |-- created: string (nullable = true)
 |-- end_dt: string (nullable = true)
 |-- start_dt: string (nullable = true)
 |-- status_cd: string (nullable = true)
 |-- update_start_dt: string (nullable = true)

I want to do,
ds_base
 .select ( $"bill_id",
    $"trans_id",
    $"billing_id",
    $"asset_id",
    $"row_id",
    $"created",
    ?? <JSON format of 4 remaining columns>
    )

Answer 1

You can use struct and to_json : 您可以使用struct和to_json ：

import org.apache.spark.sql.functions.{to_json, struct}

to_json(struct($"end_dt", $"start_dt", $"status_cd", $"update_start_dt"))

As a workaround for legacy Spark versions you could convert whole object to JSON and extracting required: 作为旧版Spark版本的解决方法，您可以将整个对象转换为JSON并提取所需的内容：

import org.apache.spark.sql.functions.get_json_object

// List of column names to be kept as-is
val scalarColumns: Seq[String] = Seq("bill_id", "trans_id", ...)
// List of column names to be put in JSON
val jsonColumns: Seq[String] = Seq(
  "end_dt", "start_dt", "status_cd", "update_start_dt"
)

// Convert all records to JSON, keeping selected fields as a nested document
val json = df.select(
  scalarColumns.map(col _) :+ 
  struct(jsonColumns map col: _*).alias("json"): _*
).toJSON

json.select(
  // Extract selected columns from JSON field and cast to required types
  scalarColumns.map(c => 
    get_json_object($"value", s"$$.$c").alias(c).cast(df.schema(c).dataType)) :+ 
  // Extract JSON struct
  get_json_object($"value", "$.json").alias("json"): _*
)

This will work only as long as you have atomic types. 仅当您具有原子类型时，此方法才起作用。 Alternatively you could use standard JSON reader and specify schema for the JSON field. 或者，您可以使用标准的JSON阅读器，并为JSON字段指定架构。

import org.apache.spark.sql.types._

val combined = df.select(
  scalarColumns.map(col _) :+ 
  struct(jsonColumns map col: _*).alias("json"): _*
)

val newSchema = StructType(combined.schema.fields map {
   case StructField("json", _, _, _) => StructField("json", StringType)
   case s => s
})

spark.read.schema(newSchema).json(combined.toJSON.rdd)

Spark 2.0（而非2.1）Dataset [Row]或Dataframe-选择几列以JSON

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-05-23 11:13:18

Spark 2.0（而非2.1）Dataset [Row]或Dataframe-选择几列以JSON

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-05-23 11:13:18

解决方案1
2 已采纳 2018-05-23 11:13:18