繁体   English   中英

Azure数据工厂副本

[英]Azure Data Factory Copy

我在管道中有一个Azure数据工厂复制活动。 复制活动正在工作-但数据已被复制多次。 我的数据源是一个Azure NoSQL DB。 如何将复制活动配置为不重新复制记录?

这是我的活动

{
  "name": "Copy Usage Session Data",
  "properties": 
  {
    "description": "",
    "activities": 
    [
      {
        "type": "Copy",
        "typeProperties": 
        {
          "source": {"type": "DocumentDbCollectionSource"},
          "sink": 
          {
            "type": "SqlSink",
            "writeBatchSize": 0,
            "writeBatchTimeout": "05:00:00",
            "sliceIdentifierColumnName": "InstallationSliceIdentifier"
          },
          "translator": 
          {
            "type": "TabularTranslator",
            "ColumnMappings": "machineKey: machineKey, product: product, softwareVersion: softwareVersion, id: DocumentDBId"
          }

        },
        "inputs": [{"name": "Machine Registration Input Data"}],
        "outputs": [{"name": "Machine Registration Output Data"}],
        "policy": 
        {
          "timeout": "01:00:00",
          "concurrency": 1,
          "executionPriorityOrder": "OldestFirst"
        },
        "scheduler": 
        {
          "frequency": "Hour",
          "interval": 1
        },
        "name": "Machine Registration Data To History",
        "description": "Copy Machine Registration Data To SQL Server DB Activity"
      },
      {
        "type": "Copy",
        "typeProperties": 
        {
          "source": {"type": "DocumentDbCollectionSource"},
          "sink": 
          {
            "type": "SqlSink",
            "writeBatchSize": 0,
            "writeBatchTimeout": "05:00:00",
            "sliceIdentifierColumnName": "UsageSessionSliceIdentifier"
          },
          "translator": 
          {
            "type": "TabularTranslator",
            "ColumnMappings": "id: usageSessionId, usageInstallationId: usageInstallationId, startTime: startTime, stopTime: stopTime, currentVersion: currentVersion"
          }
        },
        "inputs": [{"name": "Usage Session Input Data"}],
        "outputs": [{"name": "Usage Session Output Data"}],
        "policy": 
        {
          "timeout": "01:00:00",
          "concurrency": 2,
          "executionPriorityOrder": "OldestFirst"
        },
        "scheduler": 
        {
          "frequency": "Hour",
          "interval": 1
        },
        "name": "Usage Session Data To History",
        "description": "Copy Usage Session Data To SQL Server DB Activity"
      }
    ],
    "start": "2017-05-29T16:15:00Z",
    "end": "2500-01-01T00:00:00Z",
    "isPaused": false,        
    "pipelineMode": "Scheduled"
  }
}

您可以使用具有创建/修改日期的查询(该查询应该存在于表中),并且仅选择当前日期的记录。 这将通过切片开始或结束日期来提供,这样您每天只能读取新创建的记录。

将管道开始日期更改为当前日期。 如果管道开始日期是过去的日期,那么会创建从该日期到当前日期的许多数据切片,并将其复制。 另外,您还设置了Concurrency : 2 这意味着将同时运行2个活动。

例如,如果您的输出数据集可用性为1天,并且管道开始日期为29-05 -2017,那么直到今天16-06-2017每天将创建总共18个数据切片。 如果将并发设置为2,则一次运行2个复制活动。 如果Concurrency : 10则并行运行10个复制活动。

注意输出数据集可用性,管道开始日期,并发性和源查询。

源查询的示例为$$Text.Format('select * from c where c.ModifiedDate >= \\'{0:yyyy-MM-ddTHH:mm:ssZ}\\' AND c.ModifiedDate < \\'{1:yyyy-MM-ddTHH:mm:ssZ}\\'', WindowStart, WindowEnd)其中ModifiedDate是一列,用于指示在该特定集合中创建文档的时间。

更新 :

{
  "name": "DocDbToBlobPipeline",
  "properties": {
    "activities": [
      {
        "type": "Copy",
        "typeProperties": {
          "source": {
            "type": "DocumentDbCollectionSource",
            "query": "SELECT Person.Id, Person.Name.First AS FirstName, Person.Name.Middle as MiddleName, Person.Name.Last AS LastName FROM Person",
            "nestingSeparator": "."
          },
          "sink": {
            "type": "BlobSink",
            "blobWriterAddHeader": true,
            "writeBatchSize": 1000,
            "writeBatchTimeout": "00:00:59"
          }
        },
        "inputs": [
          {
            "name": "PersonDocumentDbTable"
          }
        ],
        "outputs": [
          {
            "name": "PersonBlobTableOut"
          }
        ],
        "policy": {
          "concurrency": 1
        },
        "name": "CopyFromDocDbToBlob"
      }
    ],
    "start": "2015-04-01T00:00:00Z",
    "end": "2015-04-02T00:00:00Z"
  }
} 

看一下Data Factory的调度和执行

供您参考

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM