简体   繁体   English

如何在 spark 中将 Avro Schema 对象转换为 StructType

[英]How to convert Avro Schema object into StructType in spark

I have an RDD of type Row ie, RDD[Row] and avro schema object .I need to create a dataframe with this info.我有一个 Row 类型的 RDD,即 RDD[Row] 和 avro 模式对象。我需要用这个信息创建一个数据框。

I need toconvert avro schema object into StructType for creating DataFrame.我需要将 avro 架构对象转换为 StructType 以创建 DataFrame。

Can you please help .你能帮忙吗。

com.databricks.spark.avro has a class to help you with this com.databricks.spark.avro 有一个课程可以帮助您解决这个问题

 StructType requiredType = (StructType) SchemaConverters.toSqlType(AvroClass.getClassSchema()).dataType();

Please go through this specific example : http://bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/请通过这个具体的例子: http : //bytepadding.com/big-data/spark/read-write-parquet-files-using-spark/

Updated as of 2020-05-31更新至 2020-05-31

Use below if you're on scala 2.12 with a newer spark version.如果您使用的是带有较新 Spark 版本的 Scala 2.12 ,请在下面使用。

sbt: sbt:

scalaVersion := "2.12.11"
val sparkVersion = "2.4.5"
libraryDependencies += "org.apache.spark" %% "spark-avro" % sparkVersion
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.types.StructType

val schemaType = SchemaConverters
  .toSqlType(avroSchema)
  .dataType
  .asInstanceOf[StructType]

Databrics gives support to avro related utilities in spark-avro package, use below dependency in sbt "com.databricks" % "spark-avro_2.11" % "3.2.0" Databrics 支持 spark-avro 包中的 avro 相关实用程序,在 sbt "com.databricks" % "spark-avro_2.11" % "3.2.0" 中使用以下依赖项

Code代码

* *

val sqlSchema= SchemaConverters.toSqlType(avroSchema) val sqlSchema = SchemaConverters.toSqlType(avroSchema)

* *

Before '3.2.0' version, 'toSqlType' is private method so if you are using older version than 3.2 then copy complete method in your own util class else upgrade to latest version.在 '3.2.0' 版本之前,'toSqlType' 是私有方法,因此如果您使用的版本比 3.2 旧,则在您自己的 util 类中复制完整方法,否则升级到最新版本。

Any example for doing same in pyspark?在pyspark中做同样的任何例子吗? Below code works for me but there should be some other easier way to do this下面的代码对我有用,但应该有其他更简单的方法来做到这一点

# pyspark --packages org.apache.spark:spark-avro_2.11:2.4.4

import requests
import os
import avro.schema

from pyspark.sql.types import StructType

schema_registry_url = 'https://schema-registry.net/subjects/subject_name/versions/latest/schema'
schema_requests = requests.get(url=schema_registry_url)

spark_type = sc._jvm.org.apache.spark.sql.avro.SchemaConverters.toSqlType(sc._jvm.org.apache.avro.Schema.Parser().parse(schema_requests.text))

In pyspark 2.4.7 my solusion is to create an empty dataframe with avroschema and then take the the StructType object from this empty dataframe.在 pyspark 2.4.7 中,我的解决方案是使用 avroschema 创建一个空数据帧,然后从这个空数据帧中获取 StructType 对象。

with open('/path/to/some.avsc','r') as avro_file:
    avro_scheme = avro_file.read()

df = spark\
    .read\
    .format("avro")\
    .option("avroSchema", avro_scheme)\
    .load()

struct_type = df.schema

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM