[英]Reading avro data with Databricks from Azure Data Lake Gen1 generated by Azure EventHubs Capture fails
I am trying to read avro data from Azure Data Lake Gen1, generated from Azure EventHubs with Azure Event Hubs Capture enabled in Azure Databricks with pyspark:我正在尝试从 Azure Data Lake Gen1 读取 avro 数据,这些数据是从 Azure EventHubs 生成的,在 Azure Databricks 中使用 pyspark 启用了 Azure Event Hubs Capture:
inputdata = "evenhubscapturepath/*/*"
rawData = spark.read.format("avro").load(inputdata)
The following statement fails以下声明失败
rawData.count()
with和
org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 in stage 48.0 failed 4 times, most recent failure: Lost task 162.3 in stage 48.0 (TID 2807, 10.3.2.4, executor 1): java.io.IOException: Not an Avro data file
Is EventHub-Capture writing non-Avro data? EventHub-Capture 是否写入非 Avro 数据? Are there any best practices for reading EventHub captured data with Spark?
使用 Spark 读取 EventHub 捕获的数据是否有任何最佳实践?
One pattern implementing a cold ingestion path is using Event Hubs Capture .实现冷摄取路径的一种模式是使用Event Hubs Capture 。 EventHubs capturing writes one file per partition as defined with the windowing parameters .
EventHubs 捕获为每个分区写入一个文件,如windowing parameters所定义。 The data is written in avro format and can be analyzed with Apache Spark.
数据以 avro 格式编写,可以使用 Apache Spark 进行分析。
So what are best practices using this functionality?那么使用此功能的最佳实践是什么?
1. Do not over-partition 1.不要过度分区
Often I have seen people using the default configuration which finally often results in many small files.我经常看到人们使用默认配置,最终通常会产生许多小文件。 If you want to consume the data ingested via EventHubs Capture with Spark, keep in mind the best practices for file sizes in Azure Data Lake Store and partitions with Spark.
如果您想要使用通过 EventHubs Capture with Spark 引入的数据,请牢记Azure Data Lake Store 中文件大小和 Spark 分区的最佳做法。 File sizes should be ~256 MB and partitions between 10 and 50 GB.
文件大小应为 ~256 MB,分区应在 10 到 50 GB 之间。 So finally the confguration depends on the number and sizes of the messages you are consuming.
所以最后配置取决于您正在使用的消息的数量和大小。 In most cases you are doing fine with just partitioning your data per ingest-date.
在大多数情况下,您只需按摄取日期对数据进行分区就可以了。
2. Check "Do not emit empty files option" 2.勾选“不发出空文件选项”
You should check "Do not emit empty files option".您应该选中“不要发出空文件选项”。 If you want to consume the data with Spark that saves unnecessary file operations.
如果要使用 Spark 消费数据,可以节省不必要的文件操作。
3. Use the data origin in your file pathes 3. 在你的文件路径中使用数据来源
With a streaming architecture your EventHub is what a Landing Zone would be in a batch oriented architecture approach.使用流式架构,您的 EventHub 就是面向批处理的架构方法中的着陆区。 So you will ingest the data in a raw-data-layer.
因此,您将摄取原始数据层中的数据。 Good practice is to use data sources instead of the name of the EventHub in the directory path.
好的做法是在目录路径中使用数据源而不是 EventHub 的名称。 So for example if you are ingesting telemetry data from robots in your factory this could be the directory path /raw/robots/
因此,例如,如果您正在从工厂中的机器人获取遥测数据,这可能是目录路径/raw/robots/
The storage naming requires all attributes like {Namesapce}, {PartitionId} to be used.存储命名需要使用所有属性,如 {Namesapce}、{PartitionId}。 So finally the a good capture file format definition with an explicitly defined path, a daily partition and use of the remaining attributes for the filename in an Azure Data Lake Gen 2 could look like this:
所以最后一个好的捕获文件格式定义具有明确定义的路径、每日分区和使用 Azure Data Lake Gen 2 中文件名的剩余属性可能如下所示:
/raw/robots/ingest_date={Year}-{Month}-{Day}/{Hour}{Minute}{Second}-{Namespace}-{EventHub}-{PartitionId}
4. Think of a compaction job 4. 想一个压实作业
Captured data is not compressed and also might end up in to small files in your use case (as minimum write frequency is 15 minutes).捕获的数据未压缩,在您的用例中也可能最终变成小文件(因为最低写入频率为 15 分钟)。 So if necessary write a compaction job running once a day.
因此,如果有必要,请写一个每天运行一次的压缩作业。 Something like
就像是
df.repartition(5).write.format("avro").save(targetpath)
will do this job.会做这个工作。
So what are now the best practices for reading the captured data?那么现在读取捕获数据的最佳实践是什么?
5. Ignore non avro-files reading the data 5.忽略非avro文件读取数据
Azure EventHubs Capture writes temporary data to Azure Data Lake Gen1. Azure EventHubs Capture 将临时数据写入 Azure Data Lake Gen1。 Best practice is only to read data with avro-extension.
最佳实践是只读取带有 avro-extension 的数据。 You can easily achive this via a spark configuration:
您可以通过 spark 配置轻松实现此目的:
spark.conf.set("avro.mapred.ignore.inputs.without.extension", "true")
6. Read only relevant partitions 6.只读相关分区
Consider reading only relevant partitions, eg filter the current ingestion day.考虑只读取相关分区,例如过滤当前摄取日。
7. Use shared metadata 7.使用共享元数据
Reading the captured data works similar than reading the data directly from Azure EventHubs.读取捕获的数据的工作方式类似于直接从 Azure EventHubs 读取数据。 So you have to have a schema.
所以你必须有一个模式。 Assuming that you also have jobs reading the data directly with Spark Structured Streaming a good pattern is to store the metadata and share it.
假设您还有作业直接使用 Spark Structured Streaming 读取数据,一个好的模式是存储元数据并共享它。 You could just store this metadata in a Data Lake Store json file:
您可以将此元数据存储在 Data Lake Store json 文件中:
[{"MeasurementTS":"timestamp","Location":"string", "Temperature":"double"}]
and read it with this simple parsing function :并使用这个简单的解析函数读取它:
# parse the metadata to get the schema
from collections import OrderedDict
from pyspark.sql.types import *
import json
ds = dbutils.fs.head (metadata) # read metadata file
items = (json
.JSONDecoder(object_pairs_hook=OrderedDict)
.decode(ds)[0].items())
#Schema mapping
mapping = {"string": StringType, "integer": IntegerType, "double" : DoubleType, "timestamp" : TimestampType, "boolean" : BooleanType}
schema = StructType([
StructField(k, mapping.get(v.lower())(), True) for (k, v) in items])
So you could just reuse your schema:所以你可以重用你的模式:
from pyspark.sql.functions import *
parsedData = spark.read.format("avro").load(rawpath). \
selectExpr("EnqueuedTimeUtc", "cast(Body as string) as json") \
.select("EnqueuedTimeUtc", from_json("json", schema=Schema).alias("data")) \
.select("EnqueuedTimeUtc", "data.*")
Make sure the inputdata is " .avro " file.确保输入数据是“ .avro ”文件。
Since spark-avro module is external, there is no.avro API in DataFrameReader or DataFrameWriter.由于 spark-avro 模块是外部的,因此 DataFrameReader 或 DataFrameWriter 中没有 .avro API。
To load/save data in Avro format, you need to specify the data source option format as avro(or org.apache.spark.sql.avro).要以 Avro 格式加载/保存数据,您需要将数据源选项格式指定为 avro(或 org.apache.spark.sql.avro)。
Example:例子:
Python
df = spark.read.format("avro").load("examples/src/main/resources/users.avro")
OR要么
#storage->avro
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
For more details, refer the below links:有关详细信息,请参阅以下链接:
https://spark.apache.org/docs/latest/sql-data-sources-avro.html https://spark.apache.org/docs/latest/sql-data-sources-avro.html
http://blog.itaysk.com/2017/01/14/processing-event-hub-capture-files-using-spark http://blog.itaysk.com/2017/01/14/processing-event-hub-capture-files-using-spark
https://medium.com/@caiomsouza/processing-event-hubs-capture-files-avro-format-using-spark-azure-databricks-save-to-parquet-95259001d85f https://medium.com/@caiomsouza/processing-event-hubs-capture-files-avro-format-using-spark-azure-databricks-save-to-parquet-95259001d85f
Hope this helps.希望这可以帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.