简体   繁体   中英

Convert Dataset<String> containing Json into Dataset<StructType>

Spark is great at parsing JSON into a nested StructType on the initial read from disk, but what if I already have a String column containing JSON in a Dataset , and I want to map it into a Dataset with a StructType column, with schema inference that takes the whole dataset into account, while fully leveraging parallel state and avoiding reducing actions?

I know of functions schema_of_json and from_json , which are apparently intended to be used together to accomplish this, or something similar, but I'm having trouble finding actual working code examples, especially in Java.

I'll accept any answer that supplies a Java example and satisfies the goals of full schema inference and full non-reduced parallel operation. Or if that's not possible, the closest workaround.

I'm currently using Spark 2.4.0.

I've studied the following related question:

Implicit schema discovery on a JSON-formatted Spark DataFrame column

This question is similar to mine, though for Scala. There is no accepted answer. The OP announces in a comment that they have found a "hacky" solution to getting from_schema to work. The problem with the solution, beyond "hackiness", is that it infers the schema from only the first row of the dataframe, so the types are likely to be too tightly constrained:

val jsonSchema: String = df.select(schema_of_json(df.select(col("custom")).first.getString(0))).as[String].first

EDIT: I tried the solution indicated here as discussed in comments below. Here's the implementation:

    SparkSession spark = SparkSession
            .builder()
            .appName("example")
            .master("local[*]")
            .getOrCreate();

    Dataset<Row> df = spark.read().text(conf.getSourcePath());

    df.cache();

    String schema = df.select(schema_of_json(col("value")))
          .as(Encoders.STRING())
          .first();
    df.withColumn("parsedJson", from_json(col("value"), schema, new HashMap<String, String>()))
            .drop("value")
            .write()
            .mode("append")
            .parquet(conf.getDestinationPath());

From this code I obtained an error:

AnalysisException: cannot resolve 'schemaofjson(`value`)' due to data type mismatch: The input json should be a string literal and not null; however, got `value`.;;
'Project [schemaofjson(value#0) AS schemaOfjson(value)#20]
+- Relation[value#0] text

This error led me to the following Spark pull request: https://github.com/apache/spark/pull/22775

which seems to indicate that schema_of_json was never meant to apply to a whole table in order to do schema inference on the entire thing, but instead to infer a schema from a single, literal JSON sample passed in directly using lit("some json") . In that case, I don't know that Spark offers any solution for full schema inference from JSON on an entire table. Unless someone here can correct my read of this pull request or offer an alternative approach??

There's actually a very simple solution to this using DataFrameReader.json(Dataset<String>) , don't know why it hasn't come up in my searches:

    Dataset<String> ds = ...;

    spark.read()
        .json(ds)
        .write()
        .mode("append")
        .parquet(conf.getDestinationPath());

If you have multiple columns in the source Dataset, obviously you can select just the one to operate on. And the content type has to be String (not Row for example).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM