繁体   English   中英

使用REST API批量上传到Azure Data Lake Gen 2

[英]Bulk upload to Azure Data Lake Gen 2 with REST APIs

在另一个相关问题中,我曾问过如何将文件从本地上传到Microsoft Azure Data Lake Gen 2,并通过REST API向其提供了答案。 为了完整起见,建议的代码可以在下面找到。

由于对于大量相对较小的文件(0.05 MB),事实证明,这种顺序上传的文件相对较慢,我想问一下是否存在一次对所有文件进行批量上传的可能性,并假设所有路径都存在文件是事先已知的?

使用REST API将单个文件上传到ADLS Gen 2的代码:

import requests
import json

def auth(tenant_id, client_id, client_secret):
    print('auth')
    auth_headers = {
        "Content-Type": "application/x-www-form-urlencoded"
    }
    auth_body = {
        "client_id": client_id,
        "client_secret": client_secret,
        "scope" : "https://storage.azure.com/.default",
        "grant_type" : "client_credentials"
    }
    resp = requests.post(f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token", headers=auth_headers, data=auth_body)
    return (resp.status_code, json.loads(resp.text))

def mkfs(account_name, fs_name, access_token):
    print('mkfs')
    fs_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}?resource=filesystem", headers=fs_headers)
    return (resp.status_code, resp.text)

def mkdir(account_name, fs_name, dir_name, access_token):
    print('mkdir')
    dir_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{dir_name}?resource=directory", headers=dir_headers)
    return (resp.status_code, resp.text)

def touch_file(account_name, fs_name, dir_name, file_name, access_token):
    print('touch_file')
    touch_file_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{dir_name}/{file_name}?resource=file", headers=touch_file_headers)
    return (resp.status_code, resp.text)

def append_file(account_name, fs_name, path, content, position, access_token):
    print('append_file')
    append_file_headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "text/plain",
        "Content-Length": f"{len(content)}"
    }
    resp = requests.patch(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{path}?action=append&position={position}", headers=append_file_headers, data=content)
    return (resp.status_code, resp.text)

def flush_file(account_name, fs_name, path, position, access_token):
    print('flush_file')
    flush_file_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.patch(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{path}?action=flush&position={position}", headers=flush_file_headers)
    return (resp.status_code, resp.text)

def mkfile(account_name, fs_name, dir_name, file_name, local_file_name, access_token):
    print('mkfile')
    status_code, result = touch_file(account_name, fs_name, dir_name, file_name, access_token)
    if status_code == 201:
        with open(local_file_name, 'rb') as local_file:
            path = f"{dir_name}/{file_name}"
            content = local_file.read()
            position = 0
            append_file(account_name, fs_name, path, content, position, access_token)
            position = len(content)
            flush_file(account_name, fs_name, path, position, access_token)
    else:
        print(result)


if __name__ == '__main__':
    tenant_id = '<your tenant id>'
    client_id = '<your client id>'
    client_secret = '<your client secret>'

    account_name = '<your adls account name>'
    fs_name = '<your filesystem name>'
    dir_name = '<your directory name>'
    file_name = '<your file name>'
    local_file_name = '<your local file name>'

    # Acquire an Access token
    auth_status_code, auth_result = auth(tenant_id, client_id, client_secret)
    access_token = auth_status_code == 200 and auth_result['access_token'] or ''
    print(access_token)

    # Create a filesystem
    mkfs_status_code, mkfs_result = mkfs(account_name, fs_name, access_token)
    print(mkfs_status_code, mkfs_result)

    # Create a directory
    mkdir_status_code, mkdir_result = mkdir(account_name, fs_name, dir_name, access_token)
    print(mkdir_status_code, mkdir_result)

    # Create a file from local file
    mkfile(account_name, fs_name, dir_name, file_name, local_file_name, access_token)

到目前为止,将大量文件上传到ADLS gen2的最快方法是使用AzCopy。 您可以编写python代码来调用AzCopy。

首先,按照此链接下载AzCopy.exe,下载后,将其压缩,然后将azcopy.exe复制到一个文件夹(无需安装,这是一个可执行文件),例如F:\\\\azcopy\\\\v10\\\\azcopy.exe

然后从azure门户生成sas令牌,然后复制并保存sas令牌:

在此处输入图片说明

假设您已经为adls gen2帐户创建了文件系统,并且不需要手动创建目录,它将由azcopy自动创建。

您需要注意的另一件事是,对于端点,您应该使用change dfsblob :将https://youraccount.dfs.core.windows.net/更改为https://youraccount.blob.core.windows.net/

示例代码如下:

import subprocess

exepath = "F:\\azcopy\\v10\\azcopy.exe"
local_directory="F:\\temp\\1\\*"
sasToken="?sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-09-20T09:44:22Z&st=2019-09-20T01:44:22Zxxxxxxxx"

#note for the endpoint, you should change dfs to blob
endpoint="https://yygen2.blob.core.windows.net/w22/testfile5/"
myscript=exepath + " copy " + "\""+ local_directory + "\" " + "\""+endpoint+sasToken + "\"" + " --recursive"

print(myscript)
subprocess.call(myscript)

print("completed")

测试结果如下,本地目录中的所有文件/子文件夹都上传到ADLS gen2:

在此处输入图片说明

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM