[英]Convert any JSON, multiple-times nested structure into the KEY and VALUE fields
I was requested to build an ETL pipeline in Azure.我被要求在 Azure 中构建 ETL 管道。 This pipeline should该管道应
The problem is, that there are different types of JSONs structures used by the different types of records.问题是,不同类型的记录使用不同类型的 JSON 结构。 I do not want to write a custom expression per each of the class of JSON struct (there would be like hundreds of them).我不想为每个 JSON 结构类编写自定义表达式(可能有数百个)。 Rather, I'm looking for a generic mechanism, that will be able to parse them apart of the type of the input JSON structure.相反,我正在寻找一种通用机制,它将能够将它们与输入 JSON 结构的类型分开解析。
At the moment, to fulfill this requirement, I was using the ADF built-in connector for ORC.目前,为了满足这一要求,我使用了 ORC 的 ADF 内置连接器。 The process in its current design:当前设计中的流程:
Use the following TSQL statement as part of stored procedure executed after the 1. to parse the PARAMS field content使用以下TSQL语句作为1.之后执行的存储过程的一部分来解析PARAMS字段内容
SELECT uuid, AttrName = a1.[key] + COALESCE('.' + a2.[key], '') + COALESCE('.' + a3.[key], '') + COALESCE('.' + a4.[key], ''), AttrValue = COALESCE(a4.value, a3.value, a2.value, a1.value) FROM ORC.EventsSnapshot_RawData OUTER APPLY OPENJSON(params) a1 OUTER APPLY ( SELECT [key], value, type FROM OPENJSON(a1.value) WHERE ISJSON(a1.value) = 1 ) a2 OUTER APPLY ( SELECT [key], value, type FROM OPENJSON(a2.value) WHERE ISJSON(a2.value) = 1 ) a3 OUTER APPLY ( SELECT [key], value, type FROM OPENJSON(a3.value) WHERE ISJSON(a3.value) = 1 ) a4
The number of required OUTER APPLY statements is determined at the beginning by counting occurrences of "[" in the PARAMS field value and then used to dynamically generate the SQL executed via sp_executesql
需要的 OUTER APPLY 语句的数量在开始时通过统计 PARAMS 字段值中“[”的出现次数来确定,然后用于动态生成通过sp_executesql
执行的 SQL
Unfortunately, this approach is quite inefficient in terms of execution time, as for 11 MM of records it takes ca 3.5 hours to finish不幸的是,这种方法在执行时间方面效率很低,因为 11 MM 的记录需要大约 3.5 小时才能完成
Someone suggested me to use Data Bricks.有人建议我使用 Data Bricks。 Ok, so I:好的,所以我:
created the notebook with the following python code to read ORC from ADLS and materialize it to Data Bricks table使用以下 python 代码创建笔记本以从 ADLS 读取 ORC 并将其具体化到 Data Bricks 表
orcfile = "/mnt/adls/.../Input/*.orc" eventDf = spark.read.orc(orcfile) #spark.sql("drop table if exists ORC.Events_RawData") eventDf.write.mode("overwrite").saveAsTable("ORC.Events_Raw")
Can you please suggest me the correct way of achieving the goal, ie converting the PARAMS attribute to KEY, VALUE attributes in a generic way?您能否建议我实现目标的正确方法,即将 PARAMS 属性以通用方式转换为 KEY、VALUE 属性?
[EDIT] Please find below a sample JSON structures that needs to be standarized into the expected structure [编辑] 请在下面找到需要标准化为预期结构的示例 JSON 结构
Sample1样品 1
{
"correlationId": "c3xOeEEQQCCA9sEx7-u6FA",
"eventCreateTime": "2020-05-12T15:38:23.717Z",
"time": 1589297903717,
"owner": {
"ownergeography": {
"city": "abc",
"country": "abc"
},
"ownername": {
"firstname": "abc",
"lastname": "def"
},
"clientApiKey": "xxxxx",
"businessProfileApiKey": null,
"userId": null
},
"campaignType": "Mobile push"
}
Sample2样品 2
{
"correlationIds": [
{
"campaignId": "iXyS4z811Rax",
"correlationId": "b316233807ac68675f37787f5dd83871"
}
],
"variantId": 1278915,
"utmCampaign": "",
"ua.os.major": "8"
}
Sample3样品3
{
"correlationId": "ls7XmuuiThWzktUeewqgWg",
"eventCreateTime": "2020-05-12T12:40:20.786Z",
"time": 1589287220786,
"modifiedBy": {
"clientId": null,
"clientApiKey": "xxx",
"businessProfileApiKey": null,
"userId": null
},
"campaignType": "Mobile push"
}
Well, this is your get all and everything approach :-)好吧,这是你的一切方法:-)
First we create a declared table variable and fill it with your samples to simuate your issue (please try to provide this yourself the next time).首先,我们创建一个声明的表变量并用您的样本填充它以模拟您的问题(请下次尝试自己提供)。
DECLARE @table TABLE(ID INT IDENTITY, AnyJSON NVARCHAR(MAX));
INSERT INTO @table VALUES
(N' {
"correlationId": "c3xOeEEQQCCA9sEx7-u6FA",
"eventCreateTime": "2020-05-12T15:38:23.717Z",
"time": 1589297903717,
"owner": {
"ownergeography": {
"city": "abc",
"country": "abc"
},
"ownername": {
"firstname": "abc",
"lastname": "def"
},
"clientApiKey": "xxxxx",
"businessProfileApiKey": null,
"userId": null
},
"campaignType": "Mobile push"
}')
,(N'{
"correlationIds": [
{
"campaignId": "iXyS4z811Rax",
"correlationId": "b316233807ac68675f37787f5dd83871"
}
],
"variantId": 1278915,
"utmCampaign": "",
"ua.os.major": "8"
}')
,(N'{
"correlationId": "ls7XmuuiThWzktUeewqgWg",
"eventCreateTime": "2020-05-12T12:40:20.786Z",
"time": 1589287220786,
"modifiedBy": {
"clientId": null,
"clientApiKey": "xxx",
"businessProfileApiKey": null,
"userId": null
},
"campaignType": "Mobile push"
}');
--The query --查询
WITH recCTE AS
(
SELECT ID
,CAST(1 AS BIGINT) AS ObjectIndex
,CAST(N'000' COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) SortString
,1 AS NestLevel
,CAST(CONCAT(N'Root-',ID,'.') COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) AS JsonPath
,CAST(N'$' COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) AS JsonKey
,CAST(AnyJSON COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) AS JsonValue
,CAST(CASE WHEN ISJSON(AnyJSON)=1 THEN AnyJSON COLLATE DATABASE_DEFAULT ELSE NULL END AS NVARCHAR(MAX)) AS NestedJSON
FROM @table t
UNION ALL
SELECT r.ID
,ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
,CAST(CONCAT(r.SortString,STR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),3)) AS NVARCHAR(MAX))
,r.NestLevel+1
,CAST(CONCAT(r.JsonPath, A.[key] + N'.') COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))
,CAST(A.[key] COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))
,r.JsonValue COLLATE DATABASE_DEFAULT
,CAST(A.[value] COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))
FROM recCTE r
CROSS APPLY OPENJSON(r.NestedJSON) A
WHERE ISJSON(r.NestedJSON)=1
)
SELECT ID
,JsonPath
,JsonKey
,NestedJSON AS JsonValue
FROM recCTE
WHERE ISJSON(NestedJSON)=0
ORDER BY recCTE.ID,SortString;
The result结果
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.correlationId. | correlationId | c3xOeEEQQCCA9sEx7-u6FA |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.eventCreateTime. | eventCreateTime | 2020-05-12T15:38:23.717Z |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.time. | time | 1589297903717 |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownergeography.city. | city | abc |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownergeography.country. | country | abc |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownername.firstname. | firstname | abc |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownername.lastname. | lastname | def |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.clientApiKey. | clientApiKey | xxxxx |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.campaignType. | campaignType | Mobile push |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.correlationIds.0.campaignId. | campaignId | iXyS4z811Rax |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.correlationIds.0.correlationId. | correlationId | b316233807ac68675f37787f5dd83871 |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.variantId. | variantId | 1278915 |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.utmCampaign. | utmCampaign | |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.ua.os.major. | ua.os.major | 8 |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.correlationId. | correlationId | ls7XmuuiThWzktUeewqgWg |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.eventCreateTime. | eventCreateTime | 2020-05-12T12:40:20.786Z |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.time. | time | 1589287220786 |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.modifiedBy.clientApiKey. | clientApiKey | xxx |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.campaignType. | campaignType | Mobile push |
+---+----------------------------------------+-----------------+----------------------------------+
The idea in short:简而言之这个想法:
[value]
coming from OPENJSON
) for being valid JSON.该查询将测试任何片段(来自OPENJSON
[value]
)是否为有效的 JSON。SortString
is needed to get a final sort order.需要SortString
列来获得最终的排序顺序。Come back, if you have any open questions.回来,如果你有任何悬而未决的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.