简体   繁体   中英

How can I set a logicalType in a spark-avro 2.4 schema?

We read timestamp information from avro files in our application. I am in the process of testing an upgrade from Spark 2.3.1 to Spark 2.4 which includes the newly built-in spark-avro integration. However, I cannot figure out how to tell the avro schema that I want timestamps to have the logicalType of "timestamp-millis" as opposed to the default "timestamp-micros".

Just from looking at test avro files under Spark 2.3.1 using the Databricks spark-avro 4.0.0 package, we had the following fields/schema:

{"name":"id","type":["string","null"]},
{"name":"searchQuery","type":["string","null"]},
{"name":"searchTime","type":["long","null"]},
{"name":"score","type":"double"},
{"name":"searchType","type":["string","null"]}

The searchTime in there was milliseconds since the epoch stored as a long. Everything was fine.

When I bumped things up to Spark 2.4 and the built-in spark-avro 2.4.0 packages, I have these newer fields/schema:

{"name":"id","type":["string","null"]},
{"name":"searchQuery","type":["string","null"]},
{"name":"searchTime","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]},
{"name":"score","type":"double"},
{"name":"searchType","type":["string","null"]}

As one can see, the underlying type is still a long, but it is now augmented with a logicalType of "timestamp-micros". This is exactly as the release notes say will happen, however, I cannot find a way to specify the schema to use the 'timestamp-millis' option.

This becomes a problem, when I write to an avro file a Timestamp object initialized to say 10,000 seconds after the epoch, it will get read back out as 10,000,000 seconds. Under the 2.3.1/databricks-avro, it was simply a long with no information associated with it, so it came out just as it went in.

We currently build a schema by reflecting over the object of interest as follows:

val searchSchema: StructType = ScalaReflection.schemaFor[searchEntry].dataType.asInstanceOf[StructType]

I tried augmenting this by creating a modified schema that attempted t replace the StructField corresponding to the searchTime entry as follows:

    val modSearchSchema = StructType(searchSchema.fields.map {
      case StructField(name, _, nullable, metadata) if name == "searchTime" =>
        StructField(name, org.apache.spark.sql.types.DataTypes.TimestampType, nullable, metadata)
      case f => f
    })

However, the StructField object defined in spark.sql.types has no concept of a logicalType that can augment the dataType in it.

case class StructField(
    name: String,
    dataType: DataType,
    nullable: Boolean = true,
    metadata: Metadata = Metadata.empty) 

I have also tried to create a schema from a JSON representation in two ways:

val schemaJSONrepr = """{
          |          "name" : "id",
          |          "type" : "string",
          |          "nullable" : true,
          |          "metadata" : { }
          |        }, {
          |          "name" : "searchQuery",
          |          "type" : "string",
          |          "nullable" : true,
          |          "metadata" : { }
          |        }, {
          |          "name" : "searchTime",
          |          "type" : "long",
          |          "logicalType" : "timestamp-millis",
          |          "nullable" : false,
          |          "metadata" : { }
          |        }, {
          |          "name" : "score",
          |          "type" : "double",
          |          "nullable" : false,
          |          "metadata" : { }
          |        }, {
          |          "name" : "searchType",
          |          "type" : "string",
          |          "nullable" : true,
          |          "metadata" : { }
          |        }""".stripMargin

The first attempt was simply to create a DataType from that

// here spark is a SparkSession instance from a higher scope.
val schema = DataType.fromJSON(schemaJSONrepr).asInstanceOf[StructType]
spark.read
     .schema(schema)
     .format("avro")
     .option("basePath", baseUri)
     .load(uris: _*)

This failed in that it could not create a StructType for the searchTime node because it has "logicalType" in it. The second attempt was to simply create the schema by passing in the raw JSON string.

spark.read
     .schema(schemaJSONrepr)
     .format("avro")
     .option("basePath", baseUri)
     .load(uris: _*)

This fails saying that:

mismatched input '{' expecting {'SELECT', 'FROM', ...

== SQL ==

{
^^^

I have found that in the spark-avro API there is a way to GET a logicalType from a schema, but cannot figure out how to set one.

As you can see my failed attempts above, I tried to use a Schema.Parser to create an avro schema object, but the only accepted type into spark.read.schema are String and StructType.

If anyone can provide insight on how to change/specify this logicalType, I would very much appreciate it. Thanks

Okay, I think I answered my own question. When I modified the programmatically built schema to use an explicit Timestamp type

val modSearchSchema = StructType(searchSchema.fields.map {
      case StructField(name, _, nullable, metadata) if name == "searchTime" =>
        StructField(name, org.apache.spark.sql.types.DataTypes.TimestampType, nullable, metadata)
      case f => f
    })

I didn't alter the logic when we were doing our reads when we had a Row object that we were reading back out of. Originally we would read a Long and convert it to a Timestamp, which is where things went awry since it was reading back out the Long as microseconds which would make it 1,000 times bigger than we intended. Changing our read to read a Timestamp object directly let the underlying logic account for this, taking it out of our (my) hands. So:

// searchTime = new Timestamp(row.getAs[Long]("searchTime")) BROKEN

searchTime = row.getAs[Timestamp]("searchTime") // SUCCESS

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM