简体   繁体   English

如何将Azure cosmos Db中的非结构化json文件转换为结构化表?

[英]How to convert an unstructured json file from Azure cosmos Db to a structured table?

I have a json file with dynamic schema in Azure Cosmos Db (Mongo API). 我在Azure Cosmos Db(Mongo API)中有一个带有动态模式的json文件。 I want to read this file, convert it into a structured sql table and store in Azure SQL Data warehouse. 我想读取此文件,将其转换为结构化的sql表并存储在Azure SQL数据仓库中。 How do I achieve this? 我该如何实现这一目标?

I have already tried reading this unstructured data from Azure Data Factory using Copy Activity but it seems like ADF cannot read unsturctured data. 我已经尝试使用复制活动从Azure Data Factory读取此非结构化数据,但似乎ADF无法读取未建模的数据。

Sample data from my Cosmos DB is - 我的Cosmos DB的样本数据是 -

{
    "name" : "Dren",
    "details" : [
        {
            "name" : "Vinod",
            "relation" : "Father",
            "age" : 40,
            "country" : "India",
            "ph1" : "+91-9492918762",
            "ph2" : "+91-8769187451"
        },
        {
            "name" : "Den",
            "relation" : "Brother",
            "age" : 10,
            "country" : "India"
        },
        {
            "name" : "Vinita",
            "relation" : "Mother",
            "age" : 40,
            "country" : "India",
            "ph1" : "+91-9103842782"
        } ]
}

I expect NULL values for those columns whoes value does not exist in the json file. 我希望json文件中不存在这些列的值为NULL值。

As you have noticed, Data Factory doesn't manipulate unstructured data. 您已经注意到,Data Factory不会处理非结构化数据。 Relequestual has correctly suggested that an outside data mapper will be required as Azure Data Warehouse does not offer JSON manipulation either. Relequestual正确地建议外部数据映射器将是必需的,因为Azure数据仓库也不提供JSON操作。 There are a couple ways to do this from Data Factory. 有几种方法可以从Data Factory执行此操作。 Both involve calling another service to handle the mapping for you. 两者都涉及调用另一个服务来为您处理映射。

1) Have the pipeline call an Azure Function to do the work. 1)让管道调用Azure功能来完成工作。 The pipeline wouldn't be able to pass data in and out of the function- it would need to read from Cosmos and write to Azure DW on its own. 管道无法将数据传入和传出函数 - 它需要从Cosmos读取并自行写入Azure DW。 Between the two you can do your mapping in whatever language you write the function in. The upside of this is that they are fairly simple to write, but your ability to scale will be somewhat limited by how much data your function can process within a few minutes. 在这两者之间,您可以使用您编写函数的任何语言进行映射。这样做的好处是它们编写起来相当简单,但是您的扩展能力会受到函数可以处理的数据量的限制。分钟。

2) Do an interim hop in and out of Azure Data Lake . 2)进行Azure Data Lake的临时跳转。 You would copy the data into a storage account (there are a few options that work with Data Lake Analytics), call the USQL job and then load the results into Azure DW. 您可以将数据复制到存储帐户(有一些选项可用于Data Lake Analytics),调用USQL作业,然后将结果加载到Azure DW中。 The downside of this is that you are adding extra read/writes to the storage account. 这样做的缺点是您正在为存储帐户添加额外的读/写操作。 However, it does let you scale as much as you need to based on your volume. 但是,它确实允许您根据您的音量扩展所需的数量。 It is also utilizing a SQL-like language if that is your preference. 如果这是您的偏好,它也使用类似SQL的语言。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM