简体   繁体   English

使用 ADF 将 14gb 文件从 ftp 复制到 azure 数据湖存储

[英]copying 14gb file from ftp to azure data lake store using ADF

I am trying to copy 14gb file from FTP to my azure data lake store using Azure data factory.我正在尝试使用 Azure 数据工厂将 14gb 文件从 FTP 复制到我的 azure 数据湖存储。 When I executed the pipeline it started copying the file and copied almost 13.9 gb within half hour.当我执行管道时,它开始复制文件并在半小时内复制了近 13.9 GB。

Remaining data is not copied even after running the pipeline for 8 hours and finally failed by providing message that file not available.即使在运行管道 8 小时后也不会复制剩余的数据,并最终通过提供文件不可用的消息而失败。 Reason for file not available is the source team removed the file for next file.文件不可用的原因是源团队删除了下一个文件的文件。

Increased the integration unit to 250将积分单位增加到 250

{
    "name": "job_fa",
    "properties": {
        "activities": [
            {
                "name": "set_parameters_adh_or_sch",
                "description": "validate and set the parameter values based on the runtype sch or adh",
                "type": "Lookup",
                "dependsOn": [
                    {
                        "activity": "br_bs_loggin",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [
                    {
                        "name": "CheckLookup1",
                        "value": "1"
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "SqlSource",
                        "sqlReaderStoredProcedureName": "[dbo].[usp_FeedParameters_main]",
                        "storedProcedureParameters": {
                            "FeedName_in": {
                                "type": "String",
                                "value": {
                                    "value": "@pipeline().parameters.p_FeedName",
                                    "type": "Expression"
                                }
                            },
                            "RunType_in": {
                                "type": "String",
                                "value": {
                                    "value": "@pipeline().parameters.p_RunType",
                                    "type": "Expression"
                                }
                            },
                            "SrcEnddate_in": {
                                "type": "String",
                                "value": {
                                    "value": "@pipeline().parameters.p_SrcEndDate",
                                    "type": "Expression"
                                }
                            },
                            "SrcStartdate_in": {
                                "type": "String",
                                "value": {
                                    "value": "@pipeline().parameters.p_SrcStartDate",
                                    "type": "Expression"
                                }
                            },
                            "TgtDate_in": {
                                "type": "String",
                                "value": {
                                    "value": "@pipeline().parameters.p_TargetDate",
                                    "type": "Expression"
                                }
                            },
                            "SrcHour_in": {
                                "type": "String",
                                "value": {
                                    "value": "@pipeline().parameters.p_SrcHour",
                                    "type": "Expression"
                                }
                            },
                            "TgtHour_in": {
                                "type": "String",
                                "value": {
                                    "value": "@pipeline().parameters.p_TgtHour",
                                    "type": "Expression"
                                }
                            }
                        }
                    },
                    "dataset": {
                        "referenceName": "AzureSql_cdpconfiguser",
                        "type": "DatasetReference"
                    },
                    "firstRowOnly": true
                }
            },
            {
                "name": "br_bs_loggin",
                "description": "insert into the batch run and update the batch scheduler to started in case of sch run",
                "type": "Lookup",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "source": {
                        "type": "SqlSource",
                        "sqlReaderStoredProcedureName": "[dbo].[usp_BatchRun]",
                        "storedProcedureParameters": {
                            "FeedName_in": {
                                "type": "String",
                                "value": {
                                    "value": "@pipeline().parameters.p_FeedName",
                                    "type": "Expression"
                                }
                            },
                            "RunType_in": {
                                "type": "String",
                                "value": {
                                    "value": "@pipeline().parameters.p_RunType",
                                    "type": "Expression"
                                }
                            },
                            "Status_in": {
                                "type": "String",
                                "value": "Started"
                            }
                        }
                    },
                    "dataset": {
                        "referenceName": "AzureSql_cdpconfiguser",
                        "type": "DatasetReference"
                    },
                    "firstRowOnly": true
                }
            },
            {
                "name": "Check if file exists in target",
                "type": "GetMetadata",
                "dependsOn": [
                    {
                        "activity": "Copy Data WT to ADLS",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "dataset": {
                        "referenceName": "AzureDataLakeStoreFile_wt_tgt_path_and_name",
                        "type": "DatasetReference",
                        "parameters": {
                            "TgtFilePath": "@activity('set_parameters_adh_or_sch').output.firstrow.TgtFileName_wt_dt_out",
                            "TgtFileName": {
                                "value": "@activity('set_parameters_adh_or_sch').output.firstrow.TgtFileName_wt_dt_out",
                                "type": "Expression"
                            }
                        }
                    },
                    "fieldList": [
                        "exists",
                        "size"
                    ]
                }
            },
            {
                "name": "Copy Data WT to ADLS",
                "type": "Copy",
                "dependsOn": [
                    {
                        "activity": "set_parameters_adh_or_sch",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [
                    {
                        "name": "Source",
                        "value": "@{activity('set_parameters_adh_or_sch').output.firstrow.SrcFilePath_wo_dt_out}/@{activity('set_parameters_adh_or_sch').output.firstrow.SrcFileName_wt_dt_out}"
                    },
                    {
                        "name": "Destination",
                        "value": "@{activity('set_parameters_adh_or_sch').output.firstrow.TgtFilePath_wt_dt_out}/@{activity('set_parameters_adh_or_sch').output.firstrow.TgtFilePath_wt_dt_out}"
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "FileSystemSource",
                        "recursive": true
                    },
                    "sink": {
                        "type": "AzureDataLakeStoreSink"
                    },
                    "enableStaging": false,
                    "dataIntegrationUnits": 0
                },
                "inputs": [
                    {
                        "referenceName": "FTP_SRC_FA",
                        "type": "DatasetReference",
                        "parameters": {
                            "SrcFileName": "@activity('set_parameters_adh_or_sch').output.firstrow.SrcFileName_wt_dt_out",
                            "SrcFilePath": "@activity('set_parameters_adh_or_sch').output.firstrow.SrcFilePath_wo_dt_out"
                        }
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "AzureDataLakeStoreFile_wt_tgt_path_and_name",
                        "type": "DatasetReference",
                        "parameters": {
                            "TgtFilePath": "@activity('set_parameters_adh_or_sch').output.firstrow.TgtFileName_wt_dt_out",
                            "TgtFileName": {
                                "value": "@activity('set_parameters_adh_or_sch').output.firstrow.TgtFileName_wt_dt_out",
                                "type": "Expression"
                            }
                        }
                    }
                ]
            },
            {
                "name": "br_bs_update_failed",
                "type": "SqlServerStoredProcedure",
                "dependsOn": [
                    {
                        "activity": "Copy Data WT to ADLS",
                        "dependencyConditions": [
                            "Failed"
                        ]
                    }
                ],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "storedProcedureName": "[dbo].[usp_BatchRunUpdate]",
                    "storedProcedureParameters": {
                        "BatchId": {
                            "value": {
                                "value": "@activity('br_bs_loggin').output.firstrow.Batchid_out",
                                "type": "Expression"
                            },
                            "type": "String"
                        },
                        "FeedID": {
                            "value": {
                                "value": "@activity('br_bs_loggin').output.firstrow.FeedId_out",
                                "type": "Expression"
                            },
                            "type": "Int32"
                        },
                        "FeedRunId": {
                            "value": {
                                "value": "@activity('br_bs_loggin').output.firstrow.BatchRunId_out",
                                "type": "Expression"
                            },
                            "type": "Int32"
                        },
                        "Status": {
                            "value": "Failed",
                            "type": "String"
                        }
                    }
                },
                "linkedServiceName": {
                    "referenceName": "AzureSqlDatabase1_cdp_dev_sql_db_appconfig",
                    "type": "LinkedServiceReference"
                }
            },
            {
                "name": "If Condition1",
                "type": "IfCondition",
                "dependsOn": [
                    {
                        "activity": "Check if file exists in target",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "typeProperties": {
                    "expression": {
                        "value": "@equals(activity('Check if file exists in target').output.Exists,true)",
                        "type": "Expression"
                    },
                    "ifFalseActivities": [
                        {
                            "name": "Stored Procedure_failed",
                            "type": "SqlServerStoredProcedure",
                            "policy": {
                                "timeout": "7.00:00:00",
                                "retry": 0,
                                "retryIntervalInSeconds": 30,
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "typeProperties": {
                                "storedProcedureName": "[dbo].[usp_BatchRunUpdate]",
                                "storedProcedureParameters": {
                                    "BatchId": {
                                        "value": {
                                            "value": "@activity('br_bs_loggin').output.firstrow.Batchid_out",
                                            "type": "Expression"
                                        },
                                        "type": "String"
                                    },
                                    "FeedID": {
                                        "value": {
                                            "value": "@activity('br_bs_loggin').output.firstrow.FeedId_out",
                                            "type": "Expression"
                                        },
                                        "type": "Int32"
                                    },
                                    "FeedRunId": {
                                        "value": {
                                            "value": "@activity('br_bs_loggin').output.firstrow.BatchRunId_out",
                                            "type": "Expression"
                                        },
                                        "type": "Int32"
                                    },
                                    "Status": {
                                        "value": "Failed",
                                        "type": "String"
                                    }
                                }
                            },
                            "linkedServiceName": {
                                "referenceName": "AzureSqlDatabase1_cdp_dev_sql_db_appconfig",
                                "type": "LinkedServiceReference"
                            }
                        }
                    ],
                    "ifTrueActivities": [
                        {
                            "name": "Stored Procedure1",
                            "type": "SqlServerStoredProcedure",
                            "policy": {
                                "timeout": "7.00:00:00",
                                "retry": 0,
                                "retryIntervalInSeconds": 30,
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "typeProperties": {
                                "storedProcedureName": "[dbo].[usp_BatchRunUpdate]",
                                "storedProcedureParameters": {
                                    "BatchId": {
                                        "value": {
                                            "value": "@activity('br_bs_loggin').output.firstrow.Batchid_out",
                                            "type": "Expression"
                                        },
                                        "type": "String"
                                    },
                                    "FeedID": {
                                        "value": {
                                            "value": "@activity('br_bs_loggin').output.firstrow.FeedId_out",
                                            "type": "Expression"
                                        },
                                        "type": "Int32"
                                    },
                                    "FeedRunId": {
                                        "value": {
                                            "value": "@activity('br_bs_loggin').output.firstrow.BatchRunId_out",
                                            "type": "Expression"
                                        },
                                        "type": "Int32"
                                    },
                                    "Status": {
                                        "value": "Succeeded",
                                        "type": "String"
                                    }
                                }
                            },
                            "linkedServiceName": {
                                "referenceName": "AzureSqlDatabase1_cdp_dev_sql_db_appconfig",
                                "type": "LinkedServiceReference"
                            }
                        }
                    ]
                }
            }
        ],
        "parameters": {
            "p_FeedName": {
                "type": "String",
                "defaultValue": "fa_cpsmyid_vdumcap1"
            },
            "p_BatchType": {
                "type": "String",
                "defaultValue": "RAW"
            },
            "p_RunType": {
                "type": "String",
                "defaultValue": "sch"
            },
            "p_SrcStartDate": {
                "type": "String"
            },
            "p_SrcEndDate": {
                "type": "String"
            },
            "p_TargetDate": {
                "type": "String"
            },
            "p_SrcHour": {
                "type": "String"
            },
            "p_TgtHour": {
                "type": "String"
            }
        },
        "variables": {
            "v_StartDate": {
                "type": "String"
            },
            "v_EndDate": {
                "type": "String"
            }
        },
        "folder": {
            "name": "Batch_load"
        }
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

Based on your description,all concern is about improving transfer performance,i think.根据您的描述,我认为所有关注点都是提高传输性能。

Firstly,referring to the Data integration units statements , DIU only applies to Azure Integration Runtime , but not Self-hosted Integration Runtime .Your source data is from FTP,so i think it is not affected by the DIU even though you already set the largest number.(Of course,it is referred by the official document,you still could get the verification from ADF team)首先, 参考数据集成单元声明DIU仅适用于Azure 集成运行时,而不适用于自托管集成运行时。您的源数据来自 FTP,所以即使您已经设置了最大的数据,我认为它也不受DIU影响编号。(当然,这是官方文件中提到的,你仍然可以得到ADF团队的验证)

Then maybe you could get some clues to improve the copy performance from this document .那么也许你可以从这个文档中得到一些提高复印性能的线索。

Such as: 1. Try to use the parallelCopies property to indicate the parallelism that you want Copy Activity to use.如: 1. 尝试使用parallelCopies属性来指示您希望Copy Activity 使用的并行度。 But it also has some restrictions from the statements .但它也有一些来自语句的限制。

2.Try to set the sink dataset as Azure SQL Data Warehouse ,because it seems that it has better performance than ADL. 2.尝试将接收器数据集设置为Azure SQL Data Warehouse ,因为它似乎比 ADL 具有更好的性能。

在此处输入图片说明

3.Try to compression the file from the source dataset to reduce the file size. 3.尝试从源数据集中压缩文件以减小文件大小。

4.Consider to use Azure Cloud Service as source dataset such as Azure Blob Storage,as i know, the performance of copy activity between azure services is better commonly. 4.考虑使用Azure Cloud Service作为源数据集,例如Azure Blob Storage,据我所知,Azure服务之间的复制活动的性能通常更好。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Azure数据工厂(ADF)仅从Azure Data Lake存储中复制最新文件 - Copy only the latest file from azure data lake store with Azure Data Factory (ADF) 使用Azure Data Factory将数据从SAP BW复制到Azure Data Lake Store - Copying Data from SAP BW to Azure Data Lake Store using Azure Data Factory 使用 ADF/ADB/PowerShell 将 MS 团队的 excel 文件下载到 Azure Data Lake - Downloading excel file from MS teams into Azure Data Lake using ADF/ADB/PowerShell 如何使用来自Azure文件共享的多个线程将数据复制到Azure Data Lake存储? - How to copy data to Azure Data Lake store using multiple threads from azure file share? 使用Azure Data Factory将数据从Data Lake Store(JSON文件)移动到Azure搜索 - Move data from Data Lake Store (JSON file ) to Azure Search using Azure Data Factory 使用rest api的azure数据湖存储中的更新文件出现问题 - issue with update file in azure data lake store using rest api 从数据湖将大约18GB的csv文件复制到DocumentDB后,为什么在DocumentDB中显示100 GB,为什么? - After copying around 18GB csv file from data lake to DocumentDB, it shows me 100 GB in DocumentDB why? 从SAP Hana复制到Data Lake Store时,Azure数据管道复制活动会丢失列名 - Azure Data Pipeline Copy Activity loses column names when copying from SAP Hana to Data Lake Store 使用SSIS脚本组件读取Azure Data Lake Store文件 - Azure Data Lake Store File read using SSIS Script Component 从Azure Data Lake Store .NET SDK创建文件 - Create File From Azure Data Lake Store .NET SDK
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM