I have a Spark Dataframe with 10 columns and I need to store this in Postgres/ RDBMS. The table has 7 columns and 7th column takes in text (of JSON format) for further processing.
How do I select 6 columns and convert the remaining 4 columns in the DF to JSON format?
If the whole DF is to be stored as JSON, then we could use DF.write.format("json"), but only the last 4 columns are required to be in JSON format.
I tried creating a UDF (with either Jackson or Lift lib), but not successful in sending the 4 columns to the UDF.
for JSON, the DF column name is the key, DF column's value is the value.
eg:
dataset name: ds_base
root
|-- bill_id: string (nullable = true)
|-- trans_id: integer (nullable = true)
|-- billing_id: decimal(3,-10) (nullable = true)
|-- asset_id: string (nullable = true)
|-- row_id: string (nullable = true)
|-- created: string (nullable = true)
|-- end_dt: string (nullable = true)
|-- start_dt: string (nullable = true)
|-- status_cd: string (nullable = true)
|-- update_start_dt: string (nullable = true)
I want to do,
ds_base
.select ( $"bill_id",
$"trans_id",
$"billing_id",
$"asset_id",
$"row_id",
$"created",
?? <JSON format of 4 remaining columns>
)
You can use struct
and to_json
:
import org.apache.spark.sql.functions.{to_json, struct}
to_json(struct($"end_dt", $"start_dt", $"status_cd", $"update_start_dt"))
As a workaround for legacy Spark versions you could convert whole object to JSON and extracting required:
import org.apache.spark.sql.functions.get_json_object
// List of column names to be kept as-is
val scalarColumns: Seq[String] = Seq("bill_id", "trans_id", ...)
// List of column names to be put in JSON
val jsonColumns: Seq[String] = Seq(
"end_dt", "start_dt", "status_cd", "update_start_dt"
)
// Convert all records to JSON, keeping selected fields as a nested document
val json = df.select(
scalarColumns.map(col _) :+
struct(jsonColumns map col: _*).alias("json"): _*
).toJSON
json.select(
// Extract selected columns from JSON field and cast to required types
scalarColumns.map(c =>
get_json_object($"value", s"$$.$c").alias(c).cast(df.schema(c).dataType)) :+
// Extract JSON struct
get_json_object($"value", "$.json").alias("json"): _*
)
This will work only as long as you have atomic types. Alternatively you could use standard JSON reader and specify schema for the JSON field.
import org.apache.spark.sql.types._
val combined = df.select(
scalarColumns.map(col _) :+
struct(jsonColumns map col: _*).alias("json"): _*
)
val newSchema = StructType(combined.schema.fields map {
case StructField("json", _, _, _) => StructField("json", StringType)
case s => s
})
spark.read.schema(newSchema).json(combined.toJSON.rdd)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.