I have a file where each row is a stringified JSON. I want to read it into a Spark DataFrame, along with schema validation.
val schema: StructType = getSchemaFromSomewhere()
val df: DataFrame = spark.read
.option("mode", "DROPMALFORMED")
.format("json")
.schema(schema)
.load("path/to/data.json")
However, this approach performs only some very basic schema validations.
schema
- it will be dropped. In order to do that I can't use spark.read.json()
anymore because I need the data to be in JsonNode
format. So instead I read it as a text file and parse it using the JsonSchema library:
def getJsonSchemaFactory: JsonSchemaFactory = JsonSchemaFactory.byDefault
def stringToJsonSchema(str: String): Try[JsonSchema] = {
stringToJson(str).map(getJsonSchemaFactory.getJsonSchema(_))
}
def stringToJson(str: String): Try[JsonNode] = {
val mapper = new ObjectMapper
Try({
val json = mapper.readTree(str)
json
})
}
def validateJson(data: JsonNode): Boolean = {
jsonSchema.exists(jsonSchema => {
val report = jsonSchema.validateUnchecked(data, true)
report.isSuccess
})
}
lazy val jsonSchema: Option[JsonSchema] = stringToJsonSchema(schemaSource).toOption
val schema: StructType = getSchemaFromSomewhere()
val df = spark.read
.textFile("path/to/data.json")
.filter(str => {
stringToJson(str)
.map(validateJson)
.getOrElse(false)
})
.select(from_json($"value", schema) as "jsonized")
.select("jsonized.*")
The problem now is that I am parsing each string
line into json twice - once inside the filter
, and another time in the select(from_json...)
.
Some way to read JSON data from a file to a DataFrame while also applying a JsonSchema validation on all the data - invalid data should be dropped (and maybe also logged somewhere).
Dataset[JsonNode]
to a DataFrame
without parsing it more than once?Row
into JsonNode
object? That way I could flip the order - first read the DF using spark.read.json()
and then filter the DF by converting each Row
to JsonNode
and applying the JsonSchema
.Thanks
Is there a way to convert Dataset[JsonNode] to a DataFrame without parsing it more than once?
In most cases, the overhead of parsing twice is probably neglected compared to the total CPU usage of the job.
If that's not your case, you can implement you own TableProvider
in DataSourceV2
. This can be a decent long term solution if the parsing requirements might change or evolve over time.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.