简体   繁体   English

如何在 AWS Glue/Athena 上使用 AVRO 格式

[英]How to use AVRO format on AWS Glue/Athena

I've a few topics in Kafka that are writing AVRO files into S3 buckets and I would like to perform some queries on bucket using AWS Athena.我在 Kafka 中有几个主题将 AVRO 文件写入 S3 存储桶,我想使用 AWS Athena 对存储桶执行一些查询。

I'm trying to create a table but AWS Glue crawler runs and doesn't add my table (it works if I change file type to JSON).我正在尝试创建一个表,但 AWS Glue 爬网程序运行并且没有添加我的表(如果我将文件类型更改为 JSON,它会起作用)。 I've tried to create a table from Athena console but it doesn't show support to AVRO file.我试图从 Athena 控制台创建一个表,但它不显示对 AVRO 文件的支持。

Any idea on how to make it work?关于如何使其工作的任何想法?

I suggest doing it manually and not via Glue.我建议手动完成,而不是通过 Glue。 Glue only works for the most basic situations, and this falls outside that, unfortunately. Glue 仅适用于最基本的情况,不幸的是,这超出了范围。

You can find the documentation on how to create an Avro table here: https://docs.aws.amazon.com/athena/latest/ug/avro.html您可以在此处找到有关如何创建 Avro 表的文档: https : //docs.aws.amazon.com/athena/latest/ug/avro.html

The caveat for Avro tables is that you need to specify both the table columns and the Avro schema. Avro 表的警告是您需要指定表列和 Avro 架构。 This may look weird and redundant, but it's how Athena/Presto works.这可能看起来很奇怪和多余,但这就是 Athena/Presto 的工作原理。 It needs a schema to know how to interpret the files, and then it needs to know which of the properties in the files you want to expose as columns (and their types, which may or may not match the Avro types).它需要一个模式来了解如何解释文件,然后它需要知道您希望将文件中的哪些属性公开为列(及其类型,可能与 Avro 类型匹配也可能不匹配)。

CREATE EXTERNAL TABLE avro_table (
   foo STRING,
   bar INT
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal' = '
{
  "type": "record",
  "name": "example",
  "namespace": "default",
  "fields": [
    {
      "name": "foo",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "bar",
      "type": ["null", "int"],
      "default": null
    }
  ]
}
')
STORED AS AVRO
LOCATION 's3://some-bucket/data/';

Notice how the Avro schema appears as a JSON document inside of a serde property value (single quoted) – the formatting is optional, but makes this example easier to read.请注意 Avro 模式如何显示为 serde 属性值(单引号)内的 JSON 文档——格式是可选的,但使此示例更易于阅读。

Doing it manually seems to be the way to make it work.手动执行似乎是使其工作的方法。 Here is some code to generate the Athena schema directly from a literal avro schema.这是一些直接从文字 avro 模式生成 Athena 模式的代码。 It works with avro-python3 on python3.7 .它适用于avro-python3上的python3.7 avro-python3 It is taken from here: https://github.com/dataqube-GmbH/avro2athena (I am the owner of the repo)摘自这里: https : //github.com/dataqube-GmbH/avro2athena (我是 repo 的所有者)

from avro.schema import Parse, RecordSchema, PrimitiveSchema, ArraySchema, MapSchema, EnumSchema, UnionSchema, FixedSchema


def create_athena_schema_from_avro(avro_schema_literal: str) -> str:
    avro_schema: RecordSchema = Parse(avro_schema_literal)

    column_schemas = []
    for field in avro_schema.fields:
        column_name = field.name.lower()
        column_type = create_athena_column_schema(field.type)
        column_schemas.append(f"`{column_name}` {column_type}")

    return ', '.join(column_schemas)


def create_athena_column_schema(avro_schema) -> str:
    if type(avro_schema) == PrimitiveSchema:
        return rename_type_names(avro_schema.type)

    elif type(avro_schema) == ArraySchema:
        items_type = create_athena_column_schema(avro_schema.items)
        return f'array<{items_type}>'

    elif type(avro_schema) == MapSchema:
        values_type = avro_schema.values.type
        return f'map<string,{values_type}>'

    elif type(avro_schema) == RecordSchema:
        field_schemas = []
        for field in avro_schema.fields:
            field_name = field.name.lower()
            field_type = create_athena_column_schema(field.type)
            field_schemas.append(f'{field_name}:{field_type}')

        field_schema_concatenated = ','.join(field_schemas)
        return f'struct<{field_schema_concatenated}>'

    elif type(avro_schema) == UnionSchema:
        # pick the first schema which is not null
        union_schemas_not_null = [s for s in avro_schema.schemas if s.type != 'null']
        if len(union_schemas_not_null) > 0:
            return create_athena_column_schema(union_schemas_not_null[0])
        else:
            raise Exception('union schemas contains only null schema')

    elif type(avro_schema) in [EnumSchema, FixedSchema]:
        return 'string'

    else:
        raise Exception(f'unknown avro schema type {avro_schema.type}')


def rename_type_names(typ: str) -> str:
    if typ in ['long']:
        return 'bigint'
    else:
        return typ

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM