简体   繁体   English

如何使用Azure数据工厂将CosmosDb文档复制到Blob存储(单个json文件中的每个文档)

[英]How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory

I'm trying to backup my Cosmos Db storage using Azure Data Factory(v2). 我正在尝试使用Azure数据工厂(v2)备份我的Cosmos Db存储。 In general, it's doing its job, but I want to have each doc in Cosmos collection to correspond new json file in blobs storage. 总的来说,它正在完成工作,但是我想让Cosmos集合中的每个文档都与Blobs存储中的新json文件相对应。

With next copying params i'm able to copy all docs in collection into 1 file in azure blob storage: 使用下一个复制参数,我可以将集合中的所有文档复制到azure blob存储中的1个文件中:

{
"name": "ForEach_mih",
"type": "ForEach",
"typeProperties": {
    "items": {
        "value": "@pipeline().parameters.cw_items",
        "type": "Expression"
    },
    "activities": [
        {
            "name": "Copy_mih",
            "type": "Copy",
            "policy": {
                "timeout": "7.00:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false
            },
            "userProperties": [
                {
                    "name": "Source",
                    "value": "@{item().source.collectionName}"
                },
                {
                    "name": "Destination",
                    "value": "cosmos-backup-v2/@{item().destination.fileName}"
                }
            ],
            "typeProperties": {
                "source": {
                    "type": "DocumentDbCollectionSource",
                    "nestingSeparator": "."
                },
                "sink": {
                    "type": "BlobSink"
                },
                "enableStaging": false,
                "enableSkipIncompatibleRow": true,
                "redirectIncompatibleRowSettings": {
                    "linkedServiceName": {
                        "referenceName": "Clear_Test_BlobStorage",
                        "type": "LinkedServiceReference"
                    },
                    "path": "cosmos-backup-logs"
                },
                "cloudDataMovementUnits": 0
            },
            "inputs": [
                {
                    "referenceName": "SourceDataset_mih",
                    "type": "DatasetReference",
                    "parameters": {
                        "cw_collectionName": "@item().source.collectionName"
                    }
                }
            ],
            "outputs": [
                {
                    "referenceName": "DestinationDataset_mih",
                    "type": "DatasetReference",
                    "parameters": {
                        "cw_fileName": "@item().destination.fileName"
                    }
                }
            ]
        }
    ]
}
}

How I can copy each cosmos doc to separate file and give it name the as {PartitionId}-{docId}? 如何将每个波斯菊文档复制到单独的文件,并将其命名为{PartitionId}-{docId}?

UPD UPD

Source set code: 源代码集:

{
"name": "ClustersData",
"properties": {
    "linkedServiceName": {
        "referenceName": "Clear_Test_CosmosDb",
        "type": "LinkedServiceReference"
    },
    "type": "DocumentDbCollection",
    "typeProperties": {
        "collectionName": "directory-clusters"
    }
}
}

Destination set code: 目标集代码:

{
"name": "OutputClusters",
"properties": {
    "linkedServiceName": {
        "referenceName": "Clear_Test_BlobStorage",
        "type": "LinkedServiceReference"
    },
    "type": "AzureBlob",
    "typeProperties": {
        "format": {
            "type": "JsonFormat",
            "filePattern": "arrayOfObjects"
        },
        "fileName": "",
        "folderPath": "cosmos-backup-logs"
    }
}
}

Pipeline code: 管道代码:

{
"name": "copy-clsts",
"properties": {
    "activities": [
        {
            "name": "LookupClst",
            "type": "Lookup",
            "policy": {
                "timeout": "7.00:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false
            },
            "typeProperties": {
                "source": {
                    "type": "DocumentDbCollectionSource",
                    "nestingSeparator": "."
                },
                "dataset": {
                    "referenceName": "ClustersData",
                    "type": "DatasetReference"
                },
                "firstRowOnly": false
            }
        },
        {
            "name": "ForEachClst",
            "type": "ForEach",
            "dependsOn": [
                {
                    "activity": "LookupClst",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "typeProperties": {
                "items": {
                    "value": "@activity('LookupClst').output.value",
                    "type": "Expression"
                },
                "batchCount": 8,
                "activities": [
                    {
                        "name": "CpyClst",
                        "type": "Copy",
                        "policy": {
                            "timeout": "7.00:00:00",
                            "retry": 0,
                            "retryIntervalInSeconds": 30,
                            "secureOutput": false
                        },
                        "typeProperties": {
                            "source": {
                                "type": "DocumentDbCollectionSource",
                                "query": "select @{item()}",
                                "nestingSeparator": "."
                            },
                            "sink": {
                                "type": "BlobSink"
                            },
                            "enableStaging": false,
                            "enableSkipIncompatibleRow": true,
                            "cloudDataMovementUnits": 0
                        },
                        "inputs": [
                            {
                                "referenceName": "ClustersData",
                                "type": "DatasetReference"
                            }
                        ],
                        "outputs": [
                            {
                                "referenceName": "OutputClusters",
                                "type": "DatasetReference"
                            }
                        ]
                    }
                ]
            }
        }
    ]
}
}

Example of doc in input collection (all the same format): 输入集合中的doc示例(所有格式相同):

{
   "$type": "Entities.ADCluster",
    "DisplayName": "TESTNetBIOS",
    "OrgId": "9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
    "ClusterId": "ab2a242d-f1a5-62ed-b420-31b52e958586",
    "AllowLdapLifeCycleSynchronization": true,
    "DirectoryServers": [
        {
            "$type": "Entities.DirectoryServer",
            "AddressId": "e6a8edbb-ad56-4135-94af-fab50b774256",
            "Port": 389,
            "Host": "192.168.342.234"
        }
    ],
    "DomainNames": [
        "TESTNetBIOS"
    ],
    "BaseDn": null,
    "UseSsl": false,
    "RepositoryType": 1,
    "DirectoryCustomizations": null,
    "_etag": "\"140046f2-0000-0000-0000-5ac63a180000\"",
    "LastUpdateTime": "2018-04-05T15:00:40.243Z",
    "id": "ab2a242d-f1a5-62ed-b420-31b52e958586",
    "PartitionKey": "directory-clusters-9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
    "_rid": "kpvxLAs6gkmsCQAAAAAAAA==",
    "_self": "dbs/kvpxAA==/colls/kpvxLAs6gkk=/docs/kvpxALs6kgmsCQAAAAAAAA==/",
    "_attachments": "attachments/",
    "_ts": 1522940440
}

Since your cosmosdb has array and ADF doesn't support serialize array for cosmos db, this is the workaround I can provide. 由于您的cosmosdb具有数组,并且ADF不支持针对cosmos db的序列化数组,因此这是我可以提供的解决方法。

First, export all your document to json files with export json as-is (to blob or adls or file systems, any file storage). 首先,按原样导出json将所有文档导出到json文件(到blob或adls或文件系统,任何文件存储)。 I think you already knows how to do it. 我想您已经知道该怎么做。 In this way, each collection will have a json file. 这样,每个集合都会有一个json文件。

Second, handle each json file, to exact each row in the file to a single file. 其次,处理每个json文件,以将文件中的每一行精确到一个文件。

I only provide pipeline for step 2. You could use execute pipeline activity to chain step 1 and step 2. And you could even handle all the collections in step 2 with a foreach activity. 我仅为步骤2提供管道。您可以使用执行管道活动来链接步骤1和步骤2。甚至可以使用foreach活动来处理步骤2中的所有集合。

Pipeline json 管道json

{
"name": "pipeline27",
"properties": {
    "activities": [
        {
            "name": "Lookup1",
            "type": "Lookup",
            "policy": {
                "timeout": "7.00:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false
            },
            "typeProperties": {
                "source": {
                    "type": "BlobSource",
                    "recursive": true
                },
                "dataset": {
                    "referenceName": "AzureBlob7",
                    "type": "DatasetReference"
                },
                "firstRowOnly": false
            }
        },
        {
            "name": "ForEach1",
            "type": "ForEach",
            "dependsOn": [
                {
                    "activity": "Lookup1",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "typeProperties": {
                "items": {
                    "value": "@activity('Lookup1').output.value",
                    "type": "Expression"
                },
                "activities": [
                    {
                        "name": "Copy1",
                        "type": "Copy",
                        "policy": {
                            "timeout": "7.00:00:00",
                            "retry": 0,
                            "retryIntervalInSeconds": 30,
                            "secureOutput": false
                        },
                        "typeProperties": {
                            "source": {
                                "type": "DocumentDbCollectionSource",
                                "query": {
                                    "value": "select @{item()}",
                                    "type": "Expression"
                                },
                                "nestingSeparator": "."
                            },
                            "sink": {
                                "type": "BlobSink"
                            },
                            "enableStaging": false,
                            "cloudDataMovementUnits": 0
                        },
                        "inputs": [
                            {
                                "referenceName": "DocumentDbCollection1",
                                "type": "DatasetReference"
                            }
                        ],
                        "outputs": [
                            {
                                "referenceName": "AzureBlob6",
                                "type": "DatasetReference",
                                "parameters": {
                                    "id": {
                                        "value": "@item().id",
                                        "type": "Expression"
                                    },
                                    "PartitionKey": {
                                        "value": "@item().PartitionKey",
                                        "type": "Expression"
                                    }
                                }
                            }
                        ]
                    }
                ]
            }
        }
    ]
},
"type": "Microsoft.DataFactory/factories/pipelines"

} }

dataset json for lookup 用于查找的数据集json

   {
"name": "AzureBlob7",
"properties": {
    "linkedServiceName": {
        "referenceName": "bloblinkedservice",
        "type": "LinkedServiceReference"
    },
    "type": "AzureBlob",
    "typeProperties": {
        "format": {
            "type": "JsonFormat",
            "filePattern": "arrayOfObjects"
        },
        "fileName": "cosmos.json",
        "folderPath": "aaa"
    }
},
"type": "Microsoft.DataFactory/factories/datasets"

} }

Source dataset for copy. 复制的源数据集。 Actually, this dataset has no use. 实际上,该数据集没有用。 Just want to use it to host the query (select @{item()} 只想用它来托管查询(选择@ {item()}

{
"name": "DocumentDbCollection1",
"properties": {
    "linkedServiceName": {
        "referenceName": "CosmosDB-r8c",
        "type": "LinkedServiceReference"
    },
    "type": "DocumentDbCollection",
    "typeProperties": {
        "collectionName": "test"
    }
},
"type": "Microsoft.DataFactory/factories/datasets"

} }

Destination dataset. 目标数据集。 With two parameters, it also addressed your file name request. 使用两个参数,它也解决了您的文件名请求。

{
"name": "AzureBlob6",
"properties": {
    "linkedServiceName": {
        "referenceName": "AzureStorage-eastus",
        "type": "LinkedServiceReference"
    },
    "parameters": {
        "id": {
            "type": "String"
        },
        "PartitionKey": {
            "type": "String"
        }
    },
    "type": "AzureBlob",
    "typeProperties": {
        "format": {
            "type": "JsonFormat",
            "filePattern": "setOfObjects"
        },
        "fileName": {
            "value": "@{dataset().PartitionKey}-@{dataset().id}.json",
            "type": "Expression"
        },
        "folderPath": "aaacosmos"
    }
},
"type": "Microsoft.DataFactory/factories/datasets"

} }

please also note the limitation of Lookup activity: The following data sources are supported for lookup. 另请注意查找活动的局限性:支持以下数据源进行查找。 The maximum number of rows can be returned by Lookup activity is 5000, and up to 2MB in size. 查找活动可以返回的最大行数为5000,最大为2MB。 And currently the max duration for Lookup activity before timeout is one hour. 目前,超时前查找活动的最大持续时间为一小时。

Have you considered implementing this in a different way using Azure Functions? 您是否考虑过使用Azure函数以其他方式实现此目标? ADF is designed for moving data in bulk from one place to another and only generates a single file per collection. ADF旨在将大量数据从一个地方移动到另一个地方,并且每个集合仅生成一个文件。

You could consider having an Azure Function that is triggered when documents are added / updated in your collection and have the Azure Function output the document to blob storage. 您可以考虑具有在集合中添加/更新文档时触发的Azure函数,并使Azure函数将文档输出到Blob存储。 This should scale well and would be relatively easy to implement. 这应该很好地扩展,并且相对容易实现。

Just take one collection as an example. 仅以一个集合为例。 在此处输入图片说明

And inside the foreach: 在foreach内部: 在此处输入图片说明

And your lookup and copy activity source dataset reference the same cosmosdb dataset. 并且您的查找和复制活动源数据集引用相同的cosmosdb数据集。

If you want to copy your 5 collections, you could put this pipeline into an execute activity. 如果要复制5个集合,可以将此管道放入执行活动。 And the master pipeline of the execute activity has a foreach activity. 执行活动的主管道具有foreach活动。

I also struggled a bit with this, especially getting around the size limits of the Lookup activity, since we have a LOT of data to migrate. 我对此也有些挣扎,尤其是绕过Lookup活动的大小限制,因为我们有很多数据要迁移。 I ended up creating a JSON file with a list of timestamps to query the Cosmos data with, then for each of those, getting the document IDs in that range, and then for each of those, getting the full document data and saving it to a path such as PartitionKey/DocumentID . 我最终创建了一个带有时间戳列表的JSON文件,用于查询Cosmos数据,然后为每个数据获取该范围内的文档ID,然后为每个数据获取完整的文档数据并将其保存到路径,例如PartitionKey/DocumentID Here's the pipelines I created: 这是我创建的管道:

LookupTimestamps - loops through each timestamp range from a times.json file, and for each timestamp, executes the ExportFromCosmos pipeline LookupTimestamps-在times.json文件中循环遍历每个时间戳范围,并针对每个时间戳执行ExportFromCosmos管道

{
    "name": "LookupTimestamps",
    "properties": {
        "activities": [
            {
                "name": "LookupTimestamps",
                "type": "Lookup",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "source": {
                        "type": "BlobSource",
                        "recursive": false
                    },
                    "dataset": {
                        "referenceName": "BlobStorageTimestamps",
                        "type": "DatasetReference"
                    },
                    "firstRowOnly": false
                }
            },
            {
                "name": "ForEachTimestamp",
                "type": "ForEach",
                "dependsOn": [
                    {
                        "activity": "LookupTimestamps",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "typeProperties": {
                    "items": {
                        "value": "@activity('LookupTimestamps').output.value",
                        "type": "Expression"
                    },
                    "isSequential": false,
                    "activities": [
                        {
                            "name": "Execute Pipeline1",
                            "type": "ExecutePipeline",
                            "typeProperties": {
                                "pipeline": {
                                    "referenceName": "ExportFromCosmos",
                                    "type": "PipelineReference"
                                },
                                "waitOnCompletion": true,
                                "parameters": {
                                    "From": {
                                        "value": "@{item().From}",
                                        "type": "Expression"
                                    },
                                    "To": {
                                        "value": "@{item().To}",
                                        "type": "Expression"
                                    }
                                }
                            }
                        }
                    ]
                }
            }
        ]
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

ExportFromCosmos - nested pipeline that's executed from the above pipeline. ExportFromCosmos-从上述管道执行的嵌套管道。 This is to get around the fact you can't have nested ForEach activities. 这是为了避免您不能嵌套ForEach活动。

{
    "name": "ExportFromCosmos",
    "properties": {
        "activities": [
            {
                "name": "LookupDocuments",
                "type": "Lookup",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "source": {
                        "type": "DocumentDbCollectionSource",
                        "query": {
                            "value": "select c.id, c.partitionKey from c where c._ts >= @{pipeline().parameters.from} and c._ts <= @{pipeline().parameters.to} order by c._ts desc",
                            "type": "Expression"
                        },
                        "nestingSeparator": "."
                    },
                    "dataset": {
                        "referenceName": "CosmosDb",
                        "type": "DatasetReference"
                    },
                    "firstRowOnly": false
                }
            },
            {
                "name": "ForEachDocument",
                "type": "ForEach",
                "dependsOn": [
                    {
                        "activity": "LookupDocuments",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "typeProperties": {
                    "items": {
                        "value": "@activity('LookupDocuments').output.value",
                        "type": "Expression"
                    },
                    "activities": [
                        {
                            "name": "Copy1",
                            "type": "Copy",
                            "policy": {
                                "timeout": "7.00:00:00",
                                "retry": 0,
                                "retryIntervalInSeconds": 30,
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "typeProperties": {
                                "source": {
                                    "type": "DocumentDbCollectionSource",
                                    "query": {
                                        "value": "select * from c where c.id = \"@{item().id}\" and c.partitionKey = \"@{item().partitionKey}\"",
                                        "type": "Expression"
                                    },
                                    "nestingSeparator": "."
                                },
                                "sink": {
                                    "type": "BlobSink"
                                },
                                "enableStaging": false
                            },
                            "inputs": [
                                {
                                    "referenceName": "CosmosDb",
                                    "type": "DatasetReference"
                                }
                            ],
                            "outputs": [
                                {
                                    "referenceName": "BlobStorageDocuments",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "id": {
                                            "value": "@item().id",
                                            "type": "Expression"
                                        },
                                        "partitionKey": {
                                            "value": "@item().partitionKey",
                                            "type": "Expression"
                                        }
                                    }
                                }
                            ]
                        }
                    ]
                }
            }
        ],
        "parameters": {
            "from": {
                "type": "int"
            },
            "to": {
                "type": "int"
            }
        }
    }
}

BlobStorageTimestamps - dataset for the times.json file BlobStorageTimestamps- times.json文件的数据集

{
    "name": "BlobStorageTimestamps",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage1",
            "type": "LinkedServiceReference"
        },
        "type": "AzureBlob",
        "typeProperties": {
            "format": {
                "type": "JsonFormat",
                "filePattern": "arrayOfObjects"
            },
            "fileName": "times.json",
            "folderPath": "mycollection"
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
}

BlobStorageDocuments - dataset for where the documents will be saved BlobStorageDocuments-文件保存位置的数据集

{
    "name": "BlobStorageDocuments",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureBlobStorage1",
            "type": "LinkedServiceReference"
        },
        "parameters": {
            "id": {
                "type": "string"
            },
            "partitionKey": {
                "type": "string"
            }
        },
        "type": "AzureBlob",
        "typeProperties": {
            "format": {
                "type": "JsonFormat",
                "filePattern": "arrayOfObjects"
            },
            "fileName": {
                "value": "@{dataset().partitionKey}/@{dataset().id}.json",
                "type": "Expression"
            },
            "folderPath": "mycollection"
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
}

The times.json file it just a list of epoch times and looks like this: times.json文件只是一个时期的列表,看起来像这样:

[{
    "From": 1556150400,
    "To": 1556236799
},
{
    "From": 1556236800,
    "To": 1556323199
}]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Azure 数据工厂:如何在复制或导入期间重命名 Blob csv 或 Blob 存储中的文本文件? - Azure Data Factory: How to rename blob csv or text file in blob storage during copy or import? 如何将 Azure CosmosDB 容器复制(备份)到 Azure Blob 存储? - How to copy(backup) Azure CosmosDB container to Azure blob storage? 如何在 Blob 存储上使用 JSON 作为 Azure 数据工厂的参数? - How to use JSON on Blob Storage as parameter for Azure Data Factory? 如何从 Azure 数据工厂重命名 Blob 存储文件? - How to rename a blob storage file from Azure Data Factory? 如何从 Azure 数据工厂中的 blob 存储中解压缩.gz 文件? - How to unzip .gz file from blob storage in Azure Data Factory? Azure 数据工厂仅从 Blob 存储复制数据新添加的文件 - Azure Data Factory Copy Data From Blob Storage Only New Added file(s) Azure 数据工厂复制到 CosmosDB 节流 - Azure Data Factory copy to CosmosDB throttling 如何使用 Odata 链接服务将文件从共享点复制到 blob 存储 azure 数据工厂 v2 - How to copy files from sharepoint into blob storage azure data factory v2 using Odata linked service Azure 数据工厂 - 将查找活动输出复制到 blob 中的 json - Azure Data Factory - copy Lookup activity output to json in blob 如何在没有 Azure 数据工厂的情况下将 csv 文件从 blob 存储加载到 azure sql 数据库 - How to load csv file from blob storage to azure sql database without Azure Data Factory
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM