简体   繁体   English

如何从Avro模式(.avsc)创建表?

[英]How to create a table from avro schema (.avsc)?

I have an avro schema file and I need to create a table in Databricks through pyspark. 我有一个Avro模式文件,我需要通过pyspark在Databricks中创建一个表。 I don't need to load the data, just want to create the table. 我不需要加载数据,只想创建表。 The easy way is to load the JSON string and take the "name" and "type" from fields array. 最简单的方法是加载JSON字符串,并从fields数组中获取"name""type" Then generate the CREATE SQL query. 然后生成CREATE SQL查询。 I want to know if there is any programmatic way to do that with any API. 我想知道是否有任何编程方法可以使用任何API进行此操作。 Sample schema - 示例架构-

{
  "type" : "record",
  "name" : "kylosample",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "registration_dttm",
    "type" : "string",
    "doc" : "Type inferred from '2016-02-03T07:55:29Z'"
  }, {
    "name" : "id",
    "type" : "long",
    "doc" : "Type inferred from '1'"
  }, {
    "name" : "first_name",
    "type" : "string",
    "doc" : "Type inferred from 'Amanda'"
  }, {
    "name" : "last_name",
    "type" : "string",
    "doc" : "Type inferred from 'Jordan'"
  }, {
    "name" : "email",
    "type" : "string",
    "doc" : "Type inferred from 'ajordan0@com.com'"
  }, {
    "name" : "gender",
    "type" : "string",
    "doc" : "Type inferred from 'Female'"
  }, {
    "name" : "ip_address",
    "type" : "string",
    "doc" : "Type inferred from '1.197.201.2'"
  }, {
    "name" : "cc",
    "type" : [ "null", "long" ],
    "doc" : "Type inferred from '6759521864920116'",
    "default" : null
  }, {
    "name" : "country",
    "type" : "string",
    "doc" : "Type inferred from 'Indonesia'"
  }, {
    "name" : "birthdate",
    "type" : "string",
    "doc" : "Type inferred from '3/8/1971'"
  }, {
    "name" : "salary",
    "type" : [ "null", "double" ],
    "doc" : "Type inferred from '49756.53'",
    "default" : null
  }, {
    "name" : "title",
    "type" : "string",
    "doc" : "Type inferred from 'Internal Auditor'"
  }, {
    "name" : "comments",
    "type" : "string",
    "doc" : "Type inferred from '1E+02'"
  } ]
}

This does not appear to be available via Python API yet ... This is how I have done it in the past by creating an external table via Spark SQL pointing to your exported .avsc since you only want to create a table and not load any data ... example: 这似乎还不能通过Python API使用。。。这是我过去通过Spark SQL创建一个指向已导出的.avsc的外部表来完成此操作的方法,因为您只想创建一个表而不加载任何表数据...示例:

spark.sql("""
create external table db.table_name
STORED AS AVRO
LOCATION 'PATH/WHERE/DATA/WILL/BE/STORED'
TBLPROPERTIES('avro.schema.url'='PATH/TO/SCHEMA.avsc')
""")

The native Scala API in Spark 2.4 looks to have .avsc reader available now ... since you are using Databricks you can change your kernel in the notebook like %scala or %python or %sql ... Scala example: Spark 2.4中的本机Scala API似乎现在已提供.avsc阅读器...由于您正在使用Databricks,因此可以在笔记本中更改内核,例如%scala or %python or %sql ... Scala示例:

import org.apache.avro.Schema

val schema = new Schema.Parser().parse(new File("user.avsc"))

spark
  .read
  .format("avro")
  .option("avroSchema", schema.toString)
  .load("/tmp/episodes.avro")
  .show()

Reference Docs for Spark 2.4 Avro Integration => Spark 2.4 Avro集成参考文档=>

https://spark.apache.org/docs/latest/sql-data-sources-avro.html#configuration https://spark.apache.org/docs/latest/sql-data-sources-avro.html#configuration

https://databricks.com/blog/2018/11/30/apache-avro-as-a-built-in-data-source-in-apache-spark-2-4.html https://databricks.com/blog/2018/11/30/apache-avro-as-a-built-in-data-source-in-apache-spark-2-4.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM