Avro 向后兼容性無法按預期工作

Question

我有兩個 Avro 模式 V1 和 V2，它們在 spark 中讀取，如下所示：

import org.apache.spark.sql.avro.functions._

val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/V1.avsc")))

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()

val output = df
  .select(from_avro($"value", jsonFormatSchema) as $"avroFields")

V1 有兩個字段“一”和“二”

{
  "name": "test",
  "namespace": "foo.bar",
  "type": "record",
  "fields": [
    {
      "name": "one",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "two",
      "type": [
        "null",
        "string"
      ],
      "default": null
    }
  ]
}

V2 帶有新字段：“三個”

{
  "name": "test",
  "namespace": "foo.bar",
  "type": "record",
  "fields": [
    {
      "name": "one",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "two",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "three",
      "type": [
        "null",
        "string"
      ],
      "default": null
    }
  ]
}

場景：writer 使用 V1 寫入，Reader 使用 V2 解碼 avro 記錄。 我的期望是看到字段 3 填充了默認值為 null。 但是我在 Spark 工作中遇到了以下異常。

我在這里錯過了什么嗎？ 我的理解是 avro 支持向后兼容。

Exception in thread "main" java.io.EOFException
  at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473)
  at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:128)
  at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:423)
  at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
  at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
  at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
  at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
  at org.apache.avro.specific.SpecificDatumReader.readField(SpecificDatumReader.java:116)
  at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
  at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
  at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
  at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)

Answer 1

您始終必須使用寫入的確切模式對 Avro 進行解碼。這是因為 Avro 使用未標記的數據更緊湊，並且要求在解碼時存在編寫者模式。

因此，當您使用 V2 模式閱讀時，它會查找字段three （或者可能是該字段的空標記）並引發錯誤。

您可以做的是將解碼數據（使用編寫器模式解碼）映射到讀取器模式，Java 有一個 API： SpecificDatumReader(Schema writer, Schema reader) 。

Protocol Buffers 或 Thrift 做你想做的，是標記格式。 Avro 期望模式與數據一起傳輸，例如在 Avro 文件中。

Avro 向后兼容性無法按預期工作

問題描述

1 個解決方案

解決方案1
0 2021-11-20 22:18:14

Avro 向后兼容性無法按預期工作

問題描述

1 個解決方案

解決方案1 0 2021-11-20 22:18:14

解決方案1
0 2021-11-20 22:18:14