简体   繁体   English

U-sql:如何处理具有带有多个对象的多个JSON数组的Avro文件?

[英]U-sql : How to process an Avro file with multiple JSON arrays with multiple objects?

I receive an Avro file in my Data Lake Store thru streaming analytics and an event hub using capture. 我通过流分析和使用捕获的事件中心在Data Lake Store中收到一个Avro文件。

The structure of the file looks like this: 该文件的结构如下所示:

[{"id":1,"pid":"abc","value":"1","utctimestamp":1537805867},{"id":6569,"pid":"1E014000","value":"-5.8","utctimestamp":1537805867}] [{"id":2,"pid":"cde","value":"77","utctimestamp":1537772095},{"id":6658,"pid":"02002001","value":"77","utctimestamp":1537772095}] [{ “ID”:1, “PID”: “ABC”, “值”: “1”, “utctimestamp”:1537805867},{ “ID”:6569, “PID”: “1E014000”, “值”: “ -5.8”,“ utctimestamp”:1537805867}] [{“ id”:2,“ pid”:“ cde”,“ value”:“ 77”,“ utctimestamp”:1537772095},{“ id”:6658, “PID”: “02002001”, “值”: “77”, “utctimestamp”:1537772095}]

Sample File 样本文件

I've used this script: 我使用了以下脚本:

@rs =
EXTRACT
    SequenceNumber      long,
    Offset              string,
    EnqueuedTimeUtc     string,
    Body                byte[]
FROM @input_file
USING new Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor(@"
{
""type"": ""record"",
""name"": ""EventData"",
""namespace"": ""Microsoft.ServiceBus.Messaging"",
""fields"": [
    {
        ""name"": ""SequenceNumber"",
        ""type"": ""long""
    },
    {
        ""name"": ""Offset"",
        ""type"": ""string""
    },
    {
        ""name"": ""EnqueuedTimeUtc"",
        ""type"": ""string""
    },
    {
        ""name"": ""SystemProperties"",
        ""type"": {
            ""type"": ""map"",
            ""values"": [
                ""long"",
                ""double"",
                ""string"",
                ""bytes""
            ]
        }
    },
    {
        ""name"": ""Properties"",
        ""type"": {
            ""type"": ""map"",
            ""values"": [
                ""long"",
                ""double"",
                ""string"",
                ""bytes"",
                ""null""
            ]
        }
    },
    {
        ""name"": ""Body"",
        ""type"": [
            ""null"",
            ""bytes""
        ]
    }
]
}
");

@jsonify = SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(Encoding.UTF8.GetString(Body)) AS message FROM @rs;

@cnt =  SELECT  message["id"] AS id,
            message["id2"] AS pid,
            message["value"] AS value,
            message["utctimestamp"] AS utctimestamp,
            message["extra"] AS extra
    FROM @jsonify;

OUTPUT @cnt TO @output_file USING Outputters.Text(quoting: false);

The script results in a file but only with delimiting comma's in it and no values. 该脚本生成一个文件,但其中只带有定界逗号且没有值。

How do I extract / transform this structure so I can output it as a flattened 4 column csv file? 如何提取/转换此结构,以便将其输出为展平的4列csv文件?

I got this to work by exploding the JSON column again and applying the JsonTuple function again (however I suspect it could be simplified): 我通过再次展开JSON列并再次应用JsonTuple函数(但是我怀疑可以简化)来JsonTuple起作用:

@jsonify =
    SELECT JsonFunctions.JsonTuple(Encoding.UTF8.GetString(Body)) AS message
    FROM @rs;

// Explode the tuple as key-value pair;
@working =
    SELECT key,
           JsonFunctions.JsonTuple(value) AS value
    FROM @jsonify
         CROSS APPLY
             EXPLODE(message) AS y(key, value);

Full script: 完整脚本:

REFERENCE ASSEMBLY Avro;
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; 

USING Microsoft.Analytics.Samples.Formats.Json;

DECLARE @input_file string = @"\input\input21.avro";
DECLARE @output_file string = @"\output\output.csv";


@rs =
EXTRACT
 Body byte[]
FROM @input_file
USING new Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor(@"{
    ""type"": ""record"",
    ""name"": ""EventData"",
    ""namespace"": ""Microsoft.ServiceBus.Messaging"",
    ""fields"": [
        {
            ""name"": ""SequenceNumber"",
            ""type"": ""long""
        },
        {
            ""name"": ""Offset"",
            ""type"": ""string""
        },
        {
            ""name"": ""EnqueuedTimeUtc"",
            ""type"": ""string""
        },
        {
            ""name"": ""SystemProperties"",
            ""type"": {
                ""type"": ""map"",
                ""values"": [
                    ""long"",
                    ""double"",
                    ""string"",
                    ""bytes""
                ]
            }
        },
        {
            ""name"": ""Properties"",
            ""type"": {
                ""type"": ""map"",
                ""values"": [
                    ""long"",
                    ""double"",
                    ""string"",
                    ""bytes"",
                    ""null""
                ]
            }
        },
        {
            ""name"": ""Body"",
            ""type"": [
                ""null"",
                ""bytes""
            ]
        }
    ]
}");


@jsonify =
    SELECT JsonFunctions.JsonTuple(Encoding.UTF8.GetString(Body)) AS message
    FROM @rs;

// Explode the tuple as key-value pair;
@working =
    SELECT key,
           JsonFunctions.JsonTuple(value) AS value
    FROM @jsonify
         CROSS APPLY
             EXPLODE(message) AS y(key, value);


@cnt =
    SELECT value["id"] AS id,
           value["id2"] AS pid,
           value["value"] AS value,
           value["utctimestamp"] AS utctimestamp,
           value["extra"] AS extra
    FROM @working;


OUTPUT @cnt TO @output_file USING Outputters.Text(quoting: false);

My results: 我的结果:

结果

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM