简体   繁体   English

直接使用来自 Databricks 的原始 SQL 查询存储在 Azure Data Lake 中的 avro 数据文件

[英]Querying avro data files stored in Azure Data Lake directly with raw SQL from Databricks

I'm using Databricks Notebooks to read avro files stored in an Azure Data Lake Gen2.我正在使用 Databricks Notebooks 读取存储在 Azure Data Lake Gen2 中的 avro 文件。 The avro files are created by an Event Hub Capture, and present a specific schema. avro 文件由事件中心捕获创建,并呈现特定架构。 From these files I have to extract only the Body field, where the data which I'm interested in is actually stored.从这些文件中,我只需要提取 Body 字段,我感兴趣的数据实际上存储在该字段中。

I already implented this in Python and it works as expected:我已经在 Python 中实现了这一点,它按预期工作:

path = 'abfss://file_system@storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path) # 1
df1 = df0.select(df0.Body.cast('string')) # 2
rdd1 = df1.rdd.map(lambda x: x[0]) # 3
data = spark.read.json(rdd1) # 4

Now I need to translate this to raw SQL in order to filter the data directly in the SQL query.现在我需要将其转换为原始 SQL 以便直接在 SQL 查询中过滤数据。 Considering the 4 steps above, steps 1 and 2 with SQL are as follows:考虑到上述4个步骤,SQL的步骤1和2如下:

CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system@storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")

WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro)

SELECT * FROM body_array

With this partial query I get the same as df1 above (step 2 with Python):通过这个部分查询,我得到与上面的 df1 相同的结果(使用 Python 的第 2 步):

Body
[{"id":"a123","group":"0","value":1.0,"timestamp":"2020-01-01T00:00:00.0000000"},
{"id":"a123","group":"0","value":1.5,"timestamp":"2020-01-01T00:01:00.0000000"},
{"id":"a123","group":"0","value":2.3,"timestamp":"2020-01-01T00:02:00.0000000"},
{"id":"a123","group":"0","value":1.8,"timestamp":"2020-01-01T00:03:00.0000000"}]
[{"id":"b123","group":"0","value":2.0,"timestamp":"2020-01-01T00:00:01.0000000"},
{"id":"b123","group":"0","value":1.2,"timestamp":"2020-01-01T00:01:01.0000000"},
{"id":"b123","group":"0","value":2.1,"timestamp":"2020-01-01T00:02:01.0000000"},
{"id":"b123","group":"0","value":1.7,"timestamp":"2020-01-01T00:03:01.0000000"}]
...

I need to know how to introduce the steps 3 and 4 into the SQL query, to parse the strings into json objects and finally get the desired dataframe with columns id, group, value and timestamp.我需要知道如何将步骤 3 和 4 引入 SQL 查询中,将字符串解析为 json 对象,最后得到所需的 dataframe 和列。 Thanks.谢谢。

One way I found to do this with raw SQL is as follows, using from_json Spark SQL built-in function and the scheme of the Body field:我发现使用原始 SQL 执行此操作的一种方法如下,使用 from_json Spark SQL 内置 function 和 Body 字段的方案

CREATE TEMPORARY VIEW file_avro
USING avro
OPTIONS (path "abfss://file_system@storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro")

WITH body_array AS (SELECT cast(Body AS STRING) FROM file_avro),
data1 AS (SELECT from_json(Body, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') FROM body_array),
data2 AS (SELECT explode(*) FROM data1),
data3 AS (SELECT col.* FROM data2)
SELECT * FROM data3 WHERE id = "a123"     --FILTERING BY CHANNEL ID

It performs faster than the Python code I posted in the question, surely because of the use of from_json and the scheme of Body to extract data inside it.它的执行速度比我在问题中发布的 Python 代码更快,这肯定是因为使用了 from_json 和 Body 的方案来提取其中的数据。 My version of this approach in PySpark looks as follows:我在 PySpark 中的这种方法版本如下所示:

path = 'abfss://file_system@storage_account.dfs.core.windows.net/root/YYYY/MM/DD/HH/mm/file.avro'
df0 = spark.read.format('avro').load(path)
df1 = df0.selectExpr("cast(Body as string) as json_data")
df2 = df1.selectExpr("from_json(json_data, 'array<struct<id:string,group:string,value:double,timestamp:timestamp>>') as parsed_json")
data = df2.selectExpr("explode(parsed_json) as json").select("json.*")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM