简体   繁体   中英

schema mismatch converting data between 2 schemas using aliases in fastavro

I'm trying to convert some data that matches schema old_schema to the field names used in new_schema using aliases.

I've been at it for too long and can't see what is wrong with this code:

from fastavro import writer, reader, json_writer
from fastavro.schema import parse_schema
from io import BytesIO

# Sample data
input_json = [
    {
        "key1": "value1",
        "key2": "value2",
        "key3": "value3"
    }
]

# Old schema that matches the input_json
old_schema = parse_schema({
    "type": "record",
    "namespace": "com.node40",
    "name": "generated",
    "fields": [
        {
            "name": "key1",
            "type": "string"
        },
        {
            "name": "key2",
            "type": "string"
        },
        {
            "name": "key3",
            "type": "string"
        }
    ]
})

# New schema with old schema names as aliases
new_schema = parse_schema({
    "type": "record",
    "namespace": "com.node40",
    "name": "test",
    "fields": [
        {
            "name": "k1",
            "type": "string",
            "aliases": ["key1"]
        },
        {
            "name": "k2",
            "type": "string",
            "aliases": ["key2"]
        },
        {
            "name": "k3",
            "type": "string",
            "aliases": ["key3"]
        }
    ]
})
records = [
    {
        "key1": "value1",
        "key2": "value2",
        "key3": "value3"
    }
]

# Write to buffer as serialized avro using old_schema
buffer = BytesIO()
writer(buffer, old_schema, input_json, validator=True)
buffer.seek(0)

# Read serialized avro from buffer, deserialize and write to json file
input_avro = reader(buffer, new_schema)
json_writer('fitted_data.json', new_schema, input_avro)

This results in a SchemaResolutionError from fastavro . This is such a simple example but I just can't see what is wrong with this. Help appreciated!

The main problem is that your old schema is named generated with a namespace of com.node40 . The new schema has the same namespace, but is named test . The avro resolution rules state that for these records to match both schemas are records with the same (unqualified) name .

So you can either rename the new schema to match the old one, or again use aliases and on the new schema do the following:

new_schema = {
    "type": "record",
    "namespace": "com.node40",
    "name": "test",
    "aliases": ["com.node40.generated"],
    ...
}

Note: Technically you should only have to write "aliases": ["generated"] but it looks like there is a bug in fastavro where it is not handling that case correctly, but putting the fully namespaced name will work.

After you do all that, your example will still fail because at the very end you have json_writer('fitted_data.json', new_schema, input_avro) but that should be changed to:

with open('fitted_data.json', 'w') as fo:
    json_writer(fo, new_schema, input_avro)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM