简体   繁体   English

将任何 JSON、多次嵌套结构转换为 KEY 和 VALUE 字段

[英]Convert any JSON, multiple-times nested structure into the KEY and VALUE fields

I was requested to build an ETL pipeline in Azure.我被要求在 Azure 中构建 ETL 管道。 This pipeline should该管道应

  1. read ORC file submitted by the vendor to ADLS读取供应商提交给 ADLS 的 ORC 文件
  2. parse the PARAMS field, existing in the ORC structure, where JSON structure is stored, and add it as two new fields (KEY, VALUE) to the output解析存在于 ORC 结构中的 PARAMS 字段,其中存储 JSON 结构,并将其作为两个新字段(KEY,VALUE)添加到输出
  3. write the output to the Azure SQL database将输出写入 Azure SQL 数据库

The problem is, that there are different types of JSONs structures used by the different types of records.问题是,不同类型的记录使用不同类型的 JSON 结构。 I do not want to write a custom expression per each of the class of JSON struct (there would be like hundreds of them).我不想为每个 JSON 结构类编写自定义表达式(可能有数百个)。 Rather, I'm looking for a generic mechanism, that will be able to parse them apart of the type of the input JSON structure.相反,我正在寻找一种通用机制,它将能够将它们与输入 JSON 结构的类型分开解析。

At the moment, to fulfill this requirement, I was using the ADF built-in connector for ORC.目前,为了满足这一要求,我使用了 ORC 的 ADF 内置连接器。 The process in its current design:当前设计中的流程:

  1. Use a copy activity that reads ORC and moves data to Azure SQL database使用读取 ORC 并将数据移动到 Azure SQL 数据库的复制活动
  2. Use the following TSQL statement as part of stored procedure executed after the 1. to parse the PARAMS field content使用以下TSQL语句作为1.之后执行的存储过程的一部分来解析PARAMS字段内容

    SELECT uuid, AttrName = a1.[key] + COALESCE('.' + a2.[key], '') + COALESCE('.' + a3.[key], '') + COALESCE('.' + a4.[key], ''), AttrValue = COALESCE(a4.value, a3.value, a2.value, a1.value) FROM ORC.EventsSnapshot_RawData OUTER APPLY OPENJSON(params) a1 OUTER APPLY ( SELECT [key], value, type FROM OPENJSON(a1.value) WHERE ISJSON(a1.value) = 1 ) a2 OUTER APPLY ( SELECT [key], value, type FROM OPENJSON(a2.value) WHERE ISJSON(a2.value) = 1 ) a3 OUTER APPLY ( SELECT [key], value, type FROM OPENJSON(a3.value) WHERE ISJSON(a3.value) = 1 ) a4

The number of required OUTER APPLY statements is determined at the beginning by counting occurrences of "[" in the PARAMS field value and then used to dynamically generate the SQL executed via sp_executesql需要的 OUTER APPLY 语句的数量在开始时通过统计 PARAMS 字段值中“[”的出现次数来确定,然后用于动态生成通过sp_executesql执行的 SQL

Unfortunately, this approach is quite inefficient in terms of execution time, as for 11 MM of records it takes ca 3.5 hours to finish不幸的是,这种方法在执行时间方面效率很低,因为 11 MM 的记录需要大约 3.5 小时才能完成

Someone suggested me to use Data Bricks.有人建议我使用 Data Bricks。 Ok, so I:好的,所以我:

  1. created the notebook with the following python code to read ORC from ADLS and materialize it to Data Bricks table使用以下 python 代码创建笔记本以从 ADLS 读取 ORC 并将其具体化到 Data Bricks 表

     orcfile = "/mnt/adls/.../Input/*.orc" eventDf = spark.read.orc(orcfile) #spark.sql("drop table if exists ORC.Events_RawData") eventDf.write.mode("overwrite").saveAsTable("ORC.Events_Raw")
    1. now I'm trying to find out a code that would give the result I get from TSQL OPENJSONs.现在我试图找出一个代码来给出我从 TSQL OPENJSONs 得到的结果。 I started with Python code that utilizes recursion to parse the PARAMS attribute, however, it is even more inefficient than TSQL in terms of execution speed.我从使用递归解析 PARAMS 属性的 Python 代码开始,但是,在执行速度方面,它比 TSQL 效率更低。

Can you please suggest me the correct way of achieving the goal, ie converting the PARAMS attribute to KEY, VALUE attributes in a generic way?您能否建议我实现目标的正确方法,即将 PARAMS 属性以通用方式转换为 KEY、VALUE 属性?

[EDIT] Please find below a sample JSON structures that needs to be standarized into the expected structure [编辑] 请在下面找到需要标准化为预期结构的示例 JSON 结构

Sample1样品 1

    {
    "correlationId": "c3xOeEEQQCCA9sEx7-u6FA",
    "eventCreateTime": "2020-05-12T15:38:23.717Z",
    "time": 1589297903717,
    "owner": {
        "ownergeography": {
            "city": "abc",
            "country": "abc"
        },
        "ownername": {
            "firstname": "abc",
            "lastname": "def"
        },
        "clientApiKey": "xxxxx",
        "businessProfileApiKey": null,
        "userId": null
    },
    "campaignType": "Mobile push"
}

Sample2样品 2

{
    "correlationIds": [
        {
            "campaignId": "iXyS4z811Rax",
            "correlationId": "b316233807ac68675f37787f5dd83871"
        }
    ],
    "variantId": 1278915,
    "utmCampaign": "",
    "ua.os.major": "8"
    }

Sample3样品3

{
    "correlationId": "ls7XmuuiThWzktUeewqgWg",
    "eventCreateTime": "2020-05-12T12:40:20.786Z",
    "time": 1589287220786,
    "modifiedBy": {
        "clientId": null,
        "clientApiKey": "xxx",
        "businessProfileApiKey": null,
        "userId": null
    },
    "campaignType": "Mobile push"
}

Sample expected output (Spark dataFrame)示例预期输出(Spark 数据帧) 在此处输入图片说明

Well, this is your get all and everything approach :-)好吧,这是你的一切方法:-)

First we create a declared table variable and fill it with your samples to simuate your issue (please try to provide this yourself the next time).首先,我们创建一个声明的表变量并用您的样本填充它以模拟您的问题(请下次尝试自己提供)。

DECLARE @table TABLE(ID INT IDENTITY, AnyJSON NVARCHAR(MAX));
INSERT INTO @table VALUES
(N' {
    "correlationId": "c3xOeEEQQCCA9sEx7-u6FA",
    "eventCreateTime": "2020-05-12T15:38:23.717Z",
    "time": 1589297903717,
    "owner": {
        "ownergeography": {
            "city": "abc",
            "country": "abc"
        },
        "ownername": {
            "firstname": "abc",
            "lastname": "def"
        },
        "clientApiKey": "xxxxx",
        "businessProfileApiKey": null,
        "userId": null
    },
    "campaignType": "Mobile push"
}')
,(N'{
    "correlationIds": [
        {
            "campaignId": "iXyS4z811Rax",
            "correlationId": "b316233807ac68675f37787f5dd83871"
        }
    ],
    "variantId": 1278915,
    "utmCampaign": "",
    "ua.os.major": "8"
    }')
,(N'{
    "correlationId": "ls7XmuuiThWzktUeewqgWg",
    "eventCreateTime": "2020-05-12T12:40:20.786Z",
    "time": 1589287220786,
    "modifiedBy": {
        "clientId": null,
        "clientApiKey": "xxx",
        "businessProfileApiKey": null,
        "userId": null
    },
    "campaignType": "Mobile push"
}');

--The query --查询

WITH recCTE AS
(
    SELECT ID
          ,CAST(1 AS BIGINT) AS ObjectIndex
          ,CAST(N'000' COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) SortString
          ,1 AS NestLevel
          ,CAST(CONCAT(N'Root-',ID,'.') COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) AS JsonPath
          ,CAST(N'$' COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) AS JsonKey
          ,CAST(AnyJSON COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX)) AS JsonValue 
          ,CAST(CASE WHEN ISJSON(AnyJSON)=1 THEN AnyJSON COLLATE DATABASE_DEFAULT ELSE NULL END AS NVARCHAR(MAX)) AS NestedJSON 
    FROM @table t

    UNION ALL

    SELECT r.ID
          ,ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
          ,CAST(CONCAT(r.SortString,STR(ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),3)) AS NVARCHAR(MAX))
          ,r.NestLevel+1
          ,CAST(CONCAT(r.JsonPath, A.[key] + N'.') COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))
          ,CAST(A.[key] COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))
          ,r.JsonValue  COLLATE DATABASE_DEFAULT
          ,CAST(A.[value] COLLATE DATABASE_DEFAULT AS NVARCHAR(MAX))
    FROM recCTE r
    CROSS APPLY OPENJSON(r.NestedJSON) A
    WHERE ISJSON(r.NestedJSON)=1
)
SELECT ID
      ,JsonPath
      ,JsonKey
      ,NestedJSON AS JsonValue
FROM recCTE 
WHERE ISJSON(NestedJSON)=0
ORDER BY recCTE.ID,SortString;

The result结果

+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.correlationId.                  | correlationId   | c3xOeEEQQCCA9sEx7-u6FA           |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.eventCreateTime.                | eventCreateTime | 2020-05-12T15:38:23.717Z         |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.time.                           | time            | 1589297903717                    |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownergeography.city.      | city            | abc                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownergeography.country.   | country         | abc                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownername.firstname.      | firstname       | abc                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.ownername.lastname.       | lastname        | def                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.owner.clientApiKey.             | clientApiKey    | xxxxx                            |
+---+----------------------------------------+-----------------+----------------------------------+
| 1 | Root-1.campaignType.                   | campaignType    | Mobile push                      |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.correlationIds.0.campaignId.    | campaignId      | iXyS4z811Rax                     |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.correlationIds.0.correlationId. | correlationId   | b316233807ac68675f37787f5dd83871 |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.variantId.                      | variantId       | 1278915                          |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.utmCampaign.                    | utmCampaign     |                                  |
+---+----------------------------------------+-----------------+----------------------------------+
| 2 | Root-2.ua.os.major.                    | ua.os.major     | 8                                |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.correlationId.                  | correlationId   | ls7XmuuiThWzktUeewqgWg           |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.eventCreateTime.                | eventCreateTime | 2020-05-12T12:40:20.786Z         |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.time.                           | time            | 1589287220786                    |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.modifiedBy.clientApiKey.        | clientApiKey    | xxx                              |
+---+----------------------------------------+-----------------+----------------------------------+
| 3 | Root-3.campaignType.                   | campaignType    | Mobile push                      |
+---+----------------------------------------+-----------------+----------------------------------+

The idea in short:简而言之这个想法:

  • we use a recursive CTE to walk this down.我们使用递归 CTE 来解决这个问题。
  • The query will test any fragment ( [value] coming from OPENJSON ) for being valid JSON.该查询将测试任何片段(来自OPENJSON [value] )是否为有效的 JSON。
  • If the fragment is valid, this walks deeper and deeper.如果片段有效,这会走得越来越深。
  • The column SortString is needed to get a final sort order.需要SortString列来获得最终的排序顺序。

Come back, if you have any open questions.回来,如果你有任何悬而未决的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用jQuery将表单输入字段转换为嵌套的json结构 - How to convert form input fields to nested json structure using jquery 将键名更改为嵌套 JSON 的键值结构 - Change the key name to be the key value structure of a nested JSON 使用Python中的多个键值从Json创建嵌套的Json结构 - Creating nested Json structure with multiple key values in Python from Json 还有其他方法可以针对JSON中的多个嵌套字段优化此Elasticsearch查询 - Is there any another way to optimize this elasticsearch query for multiple nested fields in JSON 如何使用键值对将嵌套的JSON数据结构展平为对象 - How to flatten a nested JSON data structure into an object with key value pairs Map 使用 javascript 创建具有嵌套对象的 json 结构的键值 - Map Key value to create a json structure with nested objects using javascript python将一个json结构转换为嵌套结构 - python convert one json structure to a nested structure 将带有值列表的嵌套键转换为 JSON 中的字符串值 - Convert a nested key with values list to string value in JSON C#Json将任何动态对象转换为键值对 - C# Json Convert any dynamic object to key value pairs 使用动态字符串值将动态 json 转换为 java object。 无键值json结构 - Convert dynamic json to java object with dynamic string value. No key value json structure
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM