简体   繁体   English

使用REST API批量上传到Azure Data Lake Gen 2

[英]Bulk upload to Azure Data Lake Gen 2 with REST APIs

In another related question I had asked how to upload files from on-premise to the Microsoft Azure Data Lake Gen 2, to which an answer was provided via REST APIs. 在另一个相关问题中,我曾问过如何将文件从本地上传到Microsoft Azure Data Lake Gen 2,并通过REST API向其提供了答案。 For the sake of completeness, the proposed code can be found below. 为了完整起见,建议的代码可以在下面找到。

Since for large amounts of relatively small files (0.05 MB) this kind of sequentially uploading files has proven to be relatively slow, I would like to ask whether the possibility exists to perform a bulk upload for all of them at once, assuming all the paths of the files are known beforehand? 由于对于大量相对较小的文件(0.05 MB),事实证明,这种顺序上传的文件相对较慢,我想问一下是否存在一次对所有文件进行批量上传的可能性,并假设所有路径都存在文件是事先已知的?

The code for uploading single files to ADLS Gen 2 using REST APIs: 使用REST API将单个文件上传到ADLS Gen 2的代码:

import requests
import json

def auth(tenant_id, client_id, client_secret):
    print('auth')
    auth_headers = {
        "Content-Type": "application/x-www-form-urlencoded"
    }
    auth_body = {
        "client_id": client_id,
        "client_secret": client_secret,
        "scope" : "https://storage.azure.com/.default",
        "grant_type" : "client_credentials"
    }
    resp = requests.post(f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token", headers=auth_headers, data=auth_body)
    return (resp.status_code, json.loads(resp.text))

def mkfs(account_name, fs_name, access_token):
    print('mkfs')
    fs_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}?resource=filesystem", headers=fs_headers)
    return (resp.status_code, resp.text)

def mkdir(account_name, fs_name, dir_name, access_token):
    print('mkdir')
    dir_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{dir_name}?resource=directory", headers=dir_headers)
    return (resp.status_code, resp.text)

def touch_file(account_name, fs_name, dir_name, file_name, access_token):
    print('touch_file')
    touch_file_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{dir_name}/{file_name}?resource=file", headers=touch_file_headers)
    return (resp.status_code, resp.text)

def append_file(account_name, fs_name, path, content, position, access_token):
    print('append_file')
    append_file_headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "text/plain",
        "Content-Length": f"{len(content)}"
    }
    resp = requests.patch(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{path}?action=append&position={position}", headers=append_file_headers, data=content)
    return (resp.status_code, resp.text)

def flush_file(account_name, fs_name, path, position, access_token):
    print('flush_file')
    flush_file_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.patch(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{path}?action=flush&position={position}", headers=flush_file_headers)
    return (resp.status_code, resp.text)

def mkfile(account_name, fs_name, dir_name, file_name, local_file_name, access_token):
    print('mkfile')
    status_code, result = touch_file(account_name, fs_name, dir_name, file_name, access_token)
    if status_code == 201:
        with open(local_file_name, 'rb') as local_file:
            path = f"{dir_name}/{file_name}"
            content = local_file.read()
            position = 0
            append_file(account_name, fs_name, path, content, position, access_token)
            position = len(content)
            flush_file(account_name, fs_name, path, position, access_token)
    else:
        print(result)


if __name__ == '__main__':
    tenant_id = '<your tenant id>'
    client_id = '<your client id>'
    client_secret = '<your client secret>'

    account_name = '<your adls account name>'
    fs_name = '<your filesystem name>'
    dir_name = '<your directory name>'
    file_name = '<your file name>'
    local_file_name = '<your local file name>'

    # Acquire an Access token
    auth_status_code, auth_result = auth(tenant_id, client_id, client_secret)
    access_token = auth_status_code == 200 and auth_result['access_token'] or ''
    print(access_token)

    # Create a filesystem
    mkfs_status_code, mkfs_result = mkfs(account_name, fs_name, access_token)
    print(mkfs_status_code, mkfs_result)

    # Create a directory
    mkdir_status_code, mkdir_result = mkdir(account_name, fs_name, dir_name, access_token)
    print(mkdir_status_code, mkdir_result)

    # Create a file from local file
    mkfile(account_name, fs_name, dir_name, file_name, local_file_name, access_token)

As of now, the fastest way to upload amount of files to ADLS gen2 is using AzCopy. 到目前为止,将大量文件上传到ADLS gen2的最快方法是使用AzCopy。 You can write python code to call AzCopy. 您可以编写python代码来调用AzCopy。

First, download AzCopy.exe as per this link , after download, upzip the file, and copy the azcopy.exe to a folder(no need to install, it's an executable file), like F:\\\\azcopy\\\\v10\\\\azcopy.exe 首先,按照此链接下载AzCopy.exe,下载后,将其压缩,然后将azcopy.exe复制到一个文件夹(无需安装,这是一个可执行文件),例如F:\\\\azcopy\\\\v10\\\\azcopy.exe

Then generate sas token from azure portal, then copy and save the sas token: 然后从azure门户生成sas令牌,然后复制并保存sas令牌:

在此处输入图片说明

Assume you have created filesystem for your adls gen2 account, and you don't need to create directory manually, it will create automatically by azcopy. 假设您已经为adls gen2帐户创建了文件系统,并且不需要手动创建目录,它将由azcopy自动创建。

Another thing you need to note is, for the endpoint, you should use change dfs to blob : like change https://youraccount.dfs.core.windows.net/ to https://youraccount.blob.core.windows.net/ . 您需要注意的另一件事是,对于端点,您应该使用change dfsblob :将https://youraccount.dfs.core.windows.net/更改为https://youraccount.blob.core.windows.net/

The sample code as below: 示例代码如下:

import subprocess

exepath = "F:\\azcopy\\v10\\azcopy.exe"
local_directory="F:\\temp\\1\\*"
sasToken="?sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-09-20T09:44:22Z&st=2019-09-20T01:44:22Zxxxxxxxx"

#note for the endpoint, you should change dfs to blob
endpoint="https://yygen2.blob.core.windows.net/w22/testfile5/"
myscript=exepath + " copy " + "\""+ local_directory + "\" " + "\""+endpoint+sasToken + "\"" + " --recursive"

print(myscript)
subprocess.call(myscript)

print("completed")

The test result as below, all the files / sub-folders in local directory are uploaded to ADLS gen2: 测试结果如下,本地目录中的所有文件/子文件夹都上传到ADLS gen2:

在此处输入图片说明

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 上传到 Azure Data Lake gen 2 后 Parquet 文件不可读(Python) - Parquet file after upload to Azure Data Lake gen 2 not readable (Python) 如何将 .parquet 文件从本地计算机上传到 Azure Storage Data Lake Gen2? - How can I upload a .parquet file from my local machine to Azure Storage Data Lake Gen2? 将 json 保存到 Azure Data Lake Storage Gen 2 中的文件 - Save json to a file in Azure Data Lake Storage Gen 2 Azure Data Lake Storage Gen2 (ADLS Gen2) 作为 Kedro 管道的数据源 - Azure Data Lake Storage Gen2 (ADLS Gen2) as a data source for Kedro pipeline 将数据从 SQL Server 2016 导入到 Azure Data Lake Gen 2 的方法 - Ways to import data from SQL Server 2016 to Azure Data Lake Gen 2 使用 python 将文件从 azure 数据湖 Gen 1 移动到临时目录 - move file from azure data lake Gen 1 to a temp directory using python 从 Azure Data Lake Storage Gen 2 读取 CSV 到 Pandas Dataframe | 没有数据块 - Read CSV from Azure Data Lake Storage Gen 2 to Pandas Dataframe | NO DATABRICKS 在 Azure Data Lake Storage Gen1 中将 Spark Dataframe 保存为 Delta Table 时,有没有办法在写入之前判断将创建多少个文件? - Is there a way to tell before the write how many files will be created when saving Spark Dataframe as Delta Table in Azure Data Lake Storage Gen1? 在Azure Data Lake中计数行 - Counting lines in Azure Data Lake 如何使用 Python 从 Windows 共享网络驱动器获取文件并上传到 Azure Data Lake Storage 位置? - How to fetch files from Windows Shared Network Drive and upload to Azure Data Lake Storage location using Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM