[英]How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory
I'm trying to backup my Cosmos Db storage using Azure Data Factory(v2). 我正在尝试使用Azure数据工厂(v2)备份我的Cosmos Db存储。 In general, it's doing its job, but I want to have each doc in Cosmos collection to correspond new json file in blobs storage.
总的来说,它正在完成工作,但是我想让Cosmos集合中的每个文档都与Blobs存储中的新json文件相对应。
With next copying params i'm able to copy all docs in collection into 1 file in azure blob storage: 使用下一个复制参数,我可以将集合中的所有文档复制到azure blob存储中的1个文件中:
{
"name": "ForEach_mih",
"type": "ForEach",
"typeProperties": {
"items": {
"value": "@pipeline().parameters.cw_items",
"type": "Expression"
},
"activities": [
{
"name": "Copy_mih",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"userProperties": [
{
"name": "Source",
"value": "@{item().source.collectionName}"
},
{
"name": "Destination",
"value": "cosmos-backup-v2/@{item().destination.fileName}"
}
],
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"path": "cosmos-backup-logs"
},
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "SourceDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_collectionName": "@item().source.collectionName"
}
}
],
"outputs": [
{
"referenceName": "DestinationDataset_mih",
"type": "DatasetReference",
"parameters": {
"cw_fileName": "@item().destination.fileName"
}
}
]
}
]
}
}
How I can copy each cosmos doc to separate file and give it name the as {PartitionId}-{docId}? 如何将每个波斯菊文档复制到单独的文件,并将其命名为{PartitionId}-{docId}?
UPD UPD
Source set code: 源代码集:
{
"name": "ClustersData",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_CosmosDb",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "directory-clusters"
}
}
}
Destination set code: 目标集代码:
{
"name": "OutputClusters",
"properties": {
"linkedServiceName": {
"referenceName": "Clear_Test_BlobStorage",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "",
"folderPath": "cosmos-backup-logs"
}
}
}
Pipeline code: 管道代码:
{
"name": "copy-clsts",
"properties": {
"activities": [
{
"name": "LookupClst",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"nestingSeparator": "."
},
"dataset": {
"referenceName": "ClustersData",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachClst",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupClst",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('LookupClst').output.value",
"type": "Expression"
},
"batchCount": 8,
"activities": [
{
"name": "CpyClst",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": "select @{item()}",
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"enableSkipIncompatibleRow": true,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "ClustersData",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputClusters",
"type": "DatasetReference"
}
]
}
]
}
}
]
}
}
Example of doc in input collection (all the same format): 输入集合中的doc示例(所有格式相同):
{
"$type": "Entities.ADCluster",
"DisplayName": "TESTNetBIOS",
"OrgId": "9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"ClusterId": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"AllowLdapLifeCycleSynchronization": true,
"DirectoryServers": [
{
"$type": "Entities.DirectoryServer",
"AddressId": "e6a8edbb-ad56-4135-94af-fab50b774256",
"Port": 389,
"Host": "192.168.342.234"
}
],
"DomainNames": [
"TESTNetBIOS"
],
"BaseDn": null,
"UseSsl": false,
"RepositoryType": 1,
"DirectoryCustomizations": null,
"_etag": "\"140046f2-0000-0000-0000-5ac63a180000\"",
"LastUpdateTime": "2018-04-05T15:00:40.243Z",
"id": "ab2a242d-f1a5-62ed-b420-31b52e958586",
"PartitionKey": "directory-clusters-9b679d2a-42c5-4c9a-a2e2-3ce63c1c3506",
"_rid": "kpvxLAs6gkmsCQAAAAAAAA==",
"_self": "dbs/kvpxAA==/colls/kpvxLAs6gkk=/docs/kvpxALs6kgmsCQAAAAAAAA==/",
"_attachments": "attachments/",
"_ts": 1522940440
}
Since your cosmosdb has array and ADF doesn't support serialize array for cosmos db, this is the workaround I can provide. 由于您的cosmosdb具有数组,并且ADF不支持针对cosmos db的序列化数组,因此这是我可以提供的解决方法。
First, export all your document to json files with export json as-is (to blob or adls or file systems, any file storage). 首先,按原样导出json将所有文档导出到json文件(到blob或adls或文件系统,任何文件存储)。 I think you already knows how to do it.
我想您已经知道该怎么做。 In this way, each collection will have a json file.
这样,每个集合都会有一个json文件。
Second, handle each json file, to exact each row in the file to a single file. 其次,处理每个json文件,以将文件中的每一行精确到一个文件。
I only provide pipeline for step 2. You could use execute pipeline activity to chain step 1 and step 2. And you could even handle all the collections in step 2 with a foreach activity. 我仅为步骤2提供管道。您可以使用执行管道活动来链接步骤1和步骤2。甚至可以使用foreach活动来处理步骤2中的所有集合。
Pipeline json 管道json
{
"name": "pipeline27",
"properties": {
"activities": [
{
"name": "Lookup1",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"dataset": {
"referenceName": "AzureBlob7",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('Lookup1').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select @{item()}",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false,
"cloudDataMovementUnits": 0
},
"inputs": [
{
"referenceName": "DocumentDbCollection1",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureBlob6",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "@item().id",
"type": "Expression"
},
"PartitionKey": {
"value": "@item().PartitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
} }
dataset json for lookup 用于查找的数据集json
{
"name": "AzureBlob7",
"properties": {
"linkedServiceName": {
"referenceName": "bloblinkedservice",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "cosmos.json",
"folderPath": "aaa"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
} }
Source dataset for copy. 复制的源数据集。 Actually, this dataset has no use.
实际上,该数据集没有用。 Just want to use it to host the query (select @{item()}
只想用它来托管查询(选择@ {item()}
{
"name": "DocumentDbCollection1",
"properties": {
"linkedServiceName": {
"referenceName": "CosmosDB-r8c",
"type": "LinkedServiceReference"
},
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "test"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
} }
Destination dataset. 目标数据集。 With two parameters, it also addressed your file name request.
使用两个参数,它也解决了您的文件名请求。
{
"name": "AzureBlob6",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorage-eastus",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "String"
},
"PartitionKey": {
"type": "String"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects"
},
"fileName": {
"value": "@{dataset().PartitionKey}-@{dataset().id}.json",
"type": "Expression"
},
"folderPath": "aaacosmos"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
} }
please also note the limitation of Lookup activity: The following data sources are supported for lookup. 另请注意查找活动的局限性:支持以下数据源进行查找。 The maximum number of rows can be returned by Lookup activity is 5000, and up to 2MB in size.
查找活动可以返回的最大行数为5000,最大为2MB。 And currently the max duration for Lookup activity before timeout is one hour.
目前,超时前查找活动的最大持续时间为一小时。
Have you considered implementing this in a different way using Azure Functions? 您是否考虑过使用Azure函数以其他方式实现此目标? ADF is designed for moving data in bulk from one place to another and only generates a single file per collection.
ADF旨在将大量数据从一个地方移动到另一个地方,并且每个集合仅生成一个文件。
You could consider having an Azure Function that is triggered when documents are added / updated in your collection and have the Azure Function output the document to blob storage. 您可以考虑具有在集合中添加/更新文档时触发的Azure函数,并使Azure函数将文档输出到Blob存储。 This should scale well and would be relatively easy to implement.
这应该很好地扩展,并且相对容易实现。
Just take one collection as an example. 仅以一个集合为例。
And inside the foreach: 在foreach内部:
And your lookup and copy activity source dataset reference the same cosmosdb dataset. 并且您的查找和复制活动源数据集引用相同的cosmosdb数据集。
If you want to copy your 5 collections, you could put this pipeline into an execute activity. 如果要复制5个集合,可以将此管道放入执行活动。 And the master pipeline of the execute activity has a foreach activity.
执行活动的主管道具有foreach活动。
I also struggled a bit with this, especially getting around the size limits of the Lookup activity, since we have a LOT of data to migrate. 我对此也有些挣扎,尤其是绕过Lookup活动的大小限制,因为我们有很多数据要迁移。 I ended up creating a JSON file with a list of timestamps to query the Cosmos data with, then for each of those, getting the document IDs in that range, and then for each of those, getting the full document data and saving it to a path such as
PartitionKey/DocumentID
. 我最终创建了一个带有时间戳列表的JSON文件,用于查询Cosmos数据,然后为每个数据获取该范围内的文档ID,然后为每个数据获取完整的文档数据并将其保存到路径,例如
PartitionKey/DocumentID
。 Here's the pipelines I created: 这是我创建的管道:
LookupTimestamps - loops through each timestamp range from a times.json
file, and for each timestamp, executes the ExportFromCosmos pipeline LookupTimestamps-在
times.json
文件中循环遍历每个时间戳范围,并针对每个时间戳执行ExportFromCosmos管道
{
"name": "LookupTimestamps",
"properties": {
"activities": [
{
"name": "LookupTimestamps",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": false
},
"dataset": {
"referenceName": "BlobStorageTimestamps",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachTimestamp",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupTimestamps",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('LookupTimestamps').output.value",
"type": "Expression"
},
"isSequential": false,
"activities": [
{
"name": "Execute Pipeline1",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "ExportFromCosmos",
"type": "PipelineReference"
},
"waitOnCompletion": true,
"parameters": {
"From": {
"value": "@{item().From}",
"type": "Expression"
},
"To": {
"value": "@{item().To}",
"type": "Expression"
}
}
}
}
]
}
}
]
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
ExportFromCosmos - nested pipeline that's executed from the above pipeline. ExportFromCosmos-从上述管道执行的嵌套管道。 This is to get around the fact you can't have nested ForEach activities.
这是为了避免您不能嵌套ForEach活动。
{
"name": "ExportFromCosmos",
"properties": {
"activities": [
{
"name": "LookupDocuments",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select c.id, c.partitionKey from c where c._ts >= @{pipeline().parameters.from} and c._ts <= @{pipeline().parameters.to} order by c._ts desc",
"type": "Expression"
},
"nestingSeparator": "."
},
"dataset": {
"referenceName": "CosmosDb",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "ForEachDocument",
"type": "ForEach",
"dependsOn": [
{
"activity": "LookupDocuments",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "@activity('LookupDocuments').output.value",
"type": "Expression"
},
"activities": [
{
"name": "Copy1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DocumentDbCollectionSource",
"query": {
"value": "select * from c where c.id = \"@{item().id}\" and c.partitionKey = \"@{item().partitionKey}\"",
"type": "Expression"
},
"nestingSeparator": "."
},
"sink": {
"type": "BlobSink"
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "CosmosDb",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobStorageDocuments",
"type": "DatasetReference",
"parameters": {
"id": {
"value": "@item().id",
"type": "Expression"
},
"partitionKey": {
"value": "@item().partitionKey",
"type": "Expression"
}
}
}
]
}
]
}
}
],
"parameters": {
"from": {
"type": "int"
},
"to": {
"type": "int"
}
}
}
}
BlobStorageTimestamps - dataset for the times.json
file BlobStorageTimestamps-
times.json
文件的数据集
{
"name": "BlobStorageTimestamps",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": "times.json",
"folderPath": "mycollection"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
BlobStorageDocuments - dataset for where the documents will be saved BlobStorageDocuments-文件保存位置的数据集
{
"name": "BlobStorageDocuments",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"parameters": {
"id": {
"type": "string"
},
"partitionKey": {
"type": "string"
}
},
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "JsonFormat",
"filePattern": "arrayOfObjects"
},
"fileName": {
"value": "@{dataset().partitionKey}/@{dataset().id}.json",
"type": "Expression"
},
"folderPath": "mycollection"
}
},
"type": "Microsoft.DataFactory/factories/datasets"
}
The times.json file it just a list of epoch times and looks like this: times.json文件只是一个时期的列表,看起来像这样:
[{
"From": 1556150400,
"To": 1556236799
},
{
"From": 1556236800,
"To": 1556323199
}]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.