使用REST API批量上传到Azure Data Lake Gen 2

Question

In another related question I had asked how to upload files from on-premise to the Microsoft Azure Data Lake Gen 2, to which an answer was provided via REST APIs. 在另一个相关问题中，我曾问过如何将文件从本地上传到Microsoft Azure Data Lake Gen 2，并通过REST API向其提供了答案。 For the sake of completeness, the proposed code can be found below. 为了完整起见，建议的代码可以在下面找到。

Since for large amounts of relatively small files (0.05 MB) this kind of sequentially uploading files has proven to be relatively slow, I would like to ask whether the possibility exists to perform a bulk upload for all of them at once, assuming all the paths of the files are known beforehand? 由于对于大量相对较小的文件（0.05 MB），事实证明，这种顺序上传的文件相对较慢，我想问一下是否存在一次对所有文件进行批量上传的可能性，并假设所有路径都存在文件是事先已知的？

The code for uploading single files to ADLS Gen 2 using REST APIs: 使用REST API将单个文件上传到ADLS Gen 2的代码：

import requests
import json

def auth(tenant_id, client_id, client_secret):
    print('auth')
    auth_headers = {
        "Content-Type": "application/x-www-form-urlencoded"
    }
    auth_body = {
        "client_id": client_id,
        "client_secret": client_secret,
        "scope" : "https://storage.azure.com/.default",
        "grant_type" : "client_credentials"
    }
    resp = requests.post(f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token", headers=auth_headers, data=auth_body)
    return (resp.status_code, json.loads(resp.text))

def mkfs(account_name, fs_name, access_token):
    print('mkfs')
    fs_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}?resource=filesystem", headers=fs_headers)
    return (resp.status_code, resp.text)

def mkdir(account_name, fs_name, dir_name, access_token):
    print('mkdir')
    dir_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{dir_name}?resource=directory", headers=dir_headers)
    return (resp.status_code, resp.text)

def touch_file(account_name, fs_name, dir_name, file_name, access_token):
    print('touch_file')
    touch_file_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{dir_name}/{file_name}?resource=file", headers=touch_file_headers)
    return (resp.status_code, resp.text)

def append_file(account_name, fs_name, path, content, position, access_token):
    print('append_file')
    append_file_headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "text/plain",
        "Content-Length": f"{len(content)}"
    }
    resp = requests.patch(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{path}?action=append&position={position}", headers=append_file_headers, data=content)
    return (resp.status_code, resp.text)

def flush_file(account_name, fs_name, path, position, access_token):
    print('flush_file')
    flush_file_headers = {
        "Authorization": f"Bearer {access_token}"
    }
    resp = requests.patch(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{path}?action=flush&position={position}", headers=flush_file_headers)
    return (resp.status_code, resp.text)

def mkfile(account_name, fs_name, dir_name, file_name, local_file_name, access_token):
    print('mkfile')
    status_code, result = touch_file(account_name, fs_name, dir_name, file_name, access_token)
    if status_code == 201:
        with open(local_file_name, 'rb') as local_file:
            path = f"{dir_name}/{file_name}"
            content = local_file.read()
            position = 0
            append_file(account_name, fs_name, path, content, position, access_token)
            position = len(content)
            flush_file(account_name, fs_name, path, position, access_token)
    else:
        print(result)


if __name__ == '__main__':
    tenant_id = '<your tenant id>'
    client_id = '<your client id>'
    client_secret = '<your client secret>'

    account_name = '<your adls account name>'
    fs_name = '<your filesystem name>'
    dir_name = '<your directory name>'
    file_name = '<your file name>'
    local_file_name = '<your local file name>'

    # Acquire an Access token
    auth_status_code, auth_result = auth(tenant_id, client_id, client_secret)
    access_token = auth_status_code == 200 and auth_result['access_token'] or ''
    print(access_token)

    # Create a filesystem
    mkfs_status_code, mkfs_result = mkfs(account_name, fs_name, access_token)
    print(mkfs_status_code, mkfs_result)

    # Create a directory
    mkdir_status_code, mkdir_result = mkdir(account_name, fs_name, dir_name, access_token)
    print(mkdir_status_code, mkdir_result)

    # Create a file from local file
    mkfile(account_name, fs_name, dir_name, file_name, local_file_name, access_token)

Answer 1

As of now, the fastest way to upload amount of files to ADLS gen2 is using AzCopy. 到目前为止，将大量文件上传到ADLS gen2的最快方法是使用AzCopy。 You can write python code to call AzCopy. 您可以编写python代码来调用AzCopy。

First, download AzCopy.exe as per this link , after download, upzip the file, and copy the azcopy.exe to a folder(no need to install, it's an executable file), like F:\\\\azcopy\\\\v10\\\\azcopy.exe 首先，按照此链接下载AzCopy.exe，下载后，将其压缩，然后将azcopy.exe复制到一个文件夹（无需安装，这是一个可执行文件），例如F:\\\\azcopy\\\\v10\\\\azcopy.exe

Then generate sas token from azure portal, then copy and save the sas token: 然后从azure门户生成sas令牌，然后复制并保存sas令牌：

Assume you have created filesystem for your adls gen2 account, and you don't need to create directory manually, it will create automatically by azcopy. 假设您已经为adls gen2帐户创建了文件系统，并且不需要手动创建目录，它将由azcopy自动创建。

Another thing you need to note is, for the endpoint, you should use change dfs to blob : like change https://youraccount.dfs.core.windows.net/ to https://youraccount.blob.core.windows.net/ . 您需要注意的另一件事是，对于端点，您应该使用change dfs到blob ：将https://youraccount.dfs.core.windows.net/更改为https://youraccount.blob.core.windows.net/ 。

The sample code as below: 示例代码如下：

import subprocess

exepath = "F:\\azcopy\\v10\\azcopy.exe"
local_directory="F:\\temp\\1\\*"
sasToken="?sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-09-20T09:44:22Z&st=2019-09-20T01:44:22Zxxxxxxxx"

#note for the endpoint, you should change dfs to blob
endpoint="https://yygen2.blob.core.windows.net/w22/testfile5/"
myscript=exepath + " copy " + "\""+ local_directory + "\" " + "\""+endpoint+sasToken + "\"" + " --recursive"

print(myscript)
subprocess.call(myscript)

print("completed")

The test result as below, all the files / sub-folders in local directory are uploaded to ADLS gen2: 测试结果如下，本地目录中的所有文件/子文件夹都上传到ADLS gen2：

使用REST API批量上传到Azure Data Lake Gen 2

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-09-19 07:08:49

使用REST API批量上传到Azure Data Lake Gen 2

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-09-19 07:08:49

解决方案1
1 已采纳 2019-09-19 07:08:49