简体   繁体   中英

Azure Data Factory - Bulk Import from Blob to Azure SQL

I have simple file FD_GROUP.TXT with content:

~0100~^~Dairy and Egg Products~
~0200~^~Spices and Herbs~
~0300~^~Baby Foods~
~0400~^~Fats and Oils~
~0500~^~Poultry Products~

I am trying to bulk import these files(some with 700,000 rows) to SQL database with Azure Data Factory.

Strategy is to first delimit columns with ^ , then I replace tildes(~) with empty character, so I lose tildes(~), then insertion happens.

1. SQL solution:

DECLARE @CsvFilePath NVARCHAR(1000) = 'D:\CodePurehope\Dev\NutrientData\FD_GROUP.txt';

CREATE TABLE #TempTable
 (
    [FoodGroupCode] VARCHAR(666) NOT NULL, 
    [FoodGroupDescription] VARCHAR(60) NOT NULL
 )

DECLARE @sql NVARCHAR(4000) = 'BULK INSERT #TempTable FROM ''' + @CsvFilePath + ''' WITH ( FIELDTERMINATOR =''^'', ROWTERMINATOR =''\n'' )';
EXEC(@sql);

UPDATE #TempTable
   SET [FoodGroupCode] = REPLACE([FoodGroupCode], '~', ''),
       [FoodGroupDescription] = REPLACE([FoodGroupDescription], '~', '')
GO

INSERT INTO [dbo].[FoodGroupDescriptions]
(
    [FoodGroupCode],
    [FoodGroupDescription]
)
SELECT
    [FoodGroupCode],
    [FoodGroupDescription]
FROM
    #TempTable
GO

DROP TABLE #TempTable

2. SSIS ETL package solution: 在此处输入图片说明

Flat file source to delimit with ^ and derived column transformation to replace unnecessary tildes(~) as seen on the photo above.

How do you do it with Microsoft Azure Data Factory?
I have FD_GROUP.TXT uploaded on Azure Storage Blob as input , and table ready on Azure SQL Server for output .

I have:
- 2 linked services: AzureStorage and AzureSQL.
- 2 datasets: Blob as input and SQL as output
- 1 pipeline

在此处输入图片说明

FoodGroupDescriptionsAzureBlob settings

{
    "name": "FoodGroupDescriptionsAzureBlob",
    "properties": {
        "structure": [
            {
                "name": "FoodGroupCode",
                "type": "Int32"
            },
            {
                "name": "FoodGroupDescription",
                "type": "String"
            }
        ],
        "published": false,
        "type": "AzureBlob",
        "linkedServiceName": "AzureStorageLinkedService",
        "typeProperties": {
            "fileName": "FD_GROUP.txt",
            "folderPath": "nutrition-data/NutrientData/",
            "format": {
                "type": "TextFormat",
                "rowDelimiter": "\n",
                "columnDelimiter": "^"
            }
        },
        "availability": {
            "frequency": "Minute",
            "interval": 15
        }
    }
}

FoodGroupDescriptionsSQLAzure settings

{
    "name": "FoodGroupDescriptionsSQLAzure",
    "properties": {
        "structure": [
            {
                "name": "FoodGroupCode",
                "type": "Int32"
            },
            {
                "name": "FoodGroupDescription",
                "type": "String"
            }
        ],
        "published": false,
        "type": "AzureSqlTable",
        "linkedServiceName": "AzureSqlLinkedService",
        "typeProperties": {
            "tableName": "FoodGroupDescriptions"
        },
        "availability": {
            "frequency": "Minute",
            "interval": 15
        }
    }
}

FoodGroupDescriptionsPipeline settings

{
    "name": "FoodGroupDescriptionsPipeline",
    "properties": {
        "description": "Copy data from a blob to Azure SQL table",
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "SqlSink",
                        "writeBatchSize": 10000,
                        "writeBatchTimeout": "60.00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "FoodGroupDescriptionsAzureBlob"
                    }
                ],
                "outputs": [
                    {
                        "name": "FoodGroupDescriptionsSQLAzure"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst"
                },
                "scheduler": {
                    "frequency": "Minute",
                    "interval": 15
                },
                "name": "CopyFromBlobToSQL",
                "description": "Bulk Import FoodGroupDescriptions"
            }
        ],
        "start": "2015-07-13T00:00:00Z",
        "end": "2015-07-14T00:00:00Z",
        "isPaused": false,
        "hubName": "gymappdatafactory_hub",
        "pipelineMode": "Scheduled"
    }
}

This thing doesn't work on Azure Data Factory + I have no clue how to use replace in this context. Any help appreciated.

I am using your code and I was able to get it working by doing the following:

In your FoodGroupDescriptionsAzureBlob json definition, you need to add "external": true in the properties node. The Blob input file was created from an outside source not from an azure data factory pipeline, by setting this to true it lets azure data factory know that this input should be ready for use.

Also in the blob input definition add: "quoteChar": "~" to the "format" node, since it looks like the data is wrapped with "~" this will strip those from the data that way the INT you defined will insert properly into your sql table.

Full blob def:

{
"name": "FoodGroupDescriptionsAzureBlob",
"properties": {
    "structure": [
        {
            "name": "FoodGroupCode",
            "type": "Int32"
        },
        {
            "name": "FoodGroupDescription",
            "type": "String"
        }
    ],
    "published": false,
    "type": "AzureBlob",
    "linkedServiceName": "AzureStorageLinkedService",
    "typeProperties": {
        "fileName": "FD_Group.txt",
        "folderPath": "nutrition-data/NutrientData/",
        "format": {
            "type": "TextFormat",
            "rowDelimiter": "\n",
            "columnDelimiter": "^",
            "quoteChar": "~"
        }
    },
    "availability": {
        "frequency": "Minute",
        "interval": 15
    },
    "external": true,
    "policy": {}
}

}

Since you have the interval set for every 15 min and the pipelines start and end date for a full day this will create a slice every 15 min for the entire pipeline run duration, since you only want to run this once change the start and end to be this:

  "start": "2015-07-13T00:00:00Z",
  "end": "2015-07-13T00:15:00Z",

This will create 1 slice.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM