对新列使用不同的 avro 架构

Question

I am using flume + kafka to sink the log data to hdfs.我正在使用水槽 + kafka 将日志数据下沉到 hdfs。 My sink data type is Avro.我的接收器数据类型是 Avro。 In avro schema (.avsc), there is 80 fields as columns.在 avro schema (.avsc) 中，有 80 个字段作为列。

So I created an external table like that所以我创建了一个这样的外部表

CREATE external TABLE pgar.tiz_biaws_fraud
PARTITIONED BY(partition_date INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/data/datapool/flume/biaws/fraud'
TBLPROPERTIES ('avro.schema.url'='hdfs://xxxx-ns/data/datapool/flume/biaws/fraud.avsc')

Now, I need to add 25 more columns to avro schema.现在，我需要向 avro 模式添加另外 25 列。 In that case,在那种情况下，

if I create a new table with new schema which has 105 columns, I will have two table for one project.如果我创建一个具有 105 列的新架构的新表，我将有两个表用于一个项目。 And if I add or remove some columns in coming days, I have to create a new table for that.如果我在未来几天添加或删除一些列，我必须为此创建一个新表。 I am afraid of having a lot of table which use different schema for same project.我害怕有很多表对同一个项目使用不同的模式。

If I swap the old schema with new schema in current table, I will have only one table for one project but I can't read and get old data anymore because of schema conflict.如果我在当前表中用新模式交换旧模式，我将只有一个表用于一个项目，但由于模式冲突，我无法再读取和获取旧数据。

What is the best way to use avro schema in case like that?在这种情况下使用 avro 模式的最佳方法是什么？

Answer 1

This is indeed challenging.这确实具有挑战性。 The best way is to make sure all schema changes you make are compatible with the old data - so only remove columns with defaults, and make sure you give defaults in the columns you are adding.最好的方法是确保您所做的所有架构更改都与旧数据兼容 - 因此仅删除具有默认值的列，并确保在您添加的列中提供默认值。 This way you can safely swap out the schemas without a conflict and keep reading old data.通过这种方式，您可以安全地交换架构而不会发生冲突并继续读取旧数据。 Avro is pretty clever about that, it's called "schema evolution" (in case you want to google a bit more) and allows reader and writer schemas to be a bit different. Avro 在这方面非常聪明，它被称为“模式演变”（以防你想多搜索一下），并允许读取器和写入器模式有所不同。

As an aside, I want to mention that Kafka has a native HDFS connector (ie without Flume) that uses Confluent's schema registry to handle these kinds of schema changes automatically - you can use the registry to check if the schemas are compatible, and if they are - simply write data using the new schema and the Hive table will automatically evolve to match.顺便说一句，我想提一下 Kafka 有一个原生 HDFS 连接器（即没有 Flume），它使用 Confluent 的模式注册表来自动处理这些类型的模式更改 - 您可以使用注册表来检查模式是否兼容，以及它们是否兼容are - 只需使用新模式写入数据，Hive 表将自动演化以匹配。

Answer 2

I added new columns to avro schema like that我像这样向 avro 模式添加了新列

{"name":"newColumn1", "type": "string", "default": ""},
{"name":"newColumn2", "type": "string", "default": ""},
{"name":"newColumn3", "type": "string", "default": ""},

When I use default property, if that columns doesn't exist in current data it returns default value, if that columns does exist in current data it returns the data value as expected.当我使用default属性时，如果当前数据中不存在该列，则返回默认值，如果当前数据中存在该列，则返回预期的数据值。

For setting null value as default, you need that要将空值设置为默认值，您需要

{ "name": "newColumn4", "type": [ "string", "null" ], "default": "null" },

or或

{ "name": "newColumn5", "type": [ "null", "string" ]},

The position of null in type property, can be first place or can be second place with default property. null 在 type 属性中的位置，可以是第一位，也可以是第二位，默认属性。

对新列使用不同的 avro 架构

问题描述

2 个解决方案

解决方案1
2 2016-11-01 14:16:12

解决方案2
1 已采纳 2016-11-02 06:29:05

对新列使用不同的 avro 架构

问题描述

2 个解决方案

解决方案1 2 2016-11-01 14:16:12

解决方案2 1 已采纳 2016-11-02 06:29:05

解决方案1
2 2016-11-01 14:16:12

解决方案2
1 已采纳 2016-11-02 06:29:05