简体   繁体   中英

Copy multiple azure containers to newly created containers efficiently using Python

I am copying the contents from multiple containers in Azure storage explorer and writing this to a bunch of new containers and want to know the most efficient way to do this.

The existing containers are called cycling-input-1, cycling-input-2,.... and the contents are written to new containers called cycling-output-1, cycling-output-2 etc. The containers are all of the of the same type (jpegs).

The for loop below creates a new container (cycling-output) with the required suffix and then copies the blobs from the relevant cycling-input container into here. I have about 30 containers each with 1000s of images in there, so not sure if this is the best way to do it (it's slow). Is there a better way to do it?

from azure.storage.blob.baseblobservice import BaseBlobService
account_name   = 'name'
account_key    = 'key'

# connect to the storage account
blob_service = BaseBlobService(account_name = account_name, account_key = account_key)

# get a list of the containers that need to be processed
cycling_containers = blob_service.list_containers(prefix = 'cycling-input')

# check the list of containers
for c in cycling_containers:
    print(c.name)


# copy across the blobs from existing containers to new containers with a prefix cycling-output 
prefix_of_new_container = 'cycling-output-'

for c in cycling_containers:
    contname = c.name
    generator = blob_service.list_blobs(contname)
    container_index = ''.join(filter(str.isdigit, contname))
    for blob in generator:
        flag_of_new_container = blob_service.create_container("%s%s" % (prefix_of_new_container, container_index))
        blob_service.copy_blob("%s%s" % (prefix_of_new_container, container_index), blob.name, "https://%s.blob.core.windows.net/%s/%s" % (account_name, contname, blob.name))

The simple way is to use multiprocessing module to parallel copy these blobs of all containers to their new containers named by replacing input with output .

Here is my sample code as reference.

from azure.storage.blob.baseblobservice import BaseBlobService
import multiprocessing

account_name = '<your account name>'
account_key = '<your account key>'

blob_service = BaseBlobService(
    account_name=account_name,
    account_key=account_key
)

cycling_containers = blob_service.list_containers(prefix = 'cycling-input')

def putBlobCopyTriples(queue, num_of_workers):
    for c in cycling_containers:
        container_name = c.name
        new_container_name = container_name.replace('input', 'output')
        blob_service.create_container(new_container_name)
        for blob in blob_service.list_blobs(container_name):
            blob_url = "https://%s.blob.core.windows.net/%s/%s" % (account_name, container_name, blob.name)
            queue.put( (new_container_name, blob.name, blob_url) )
    for i in range(num_of_workers):
        queue.put( (None, None, None) )

def copyWorker(lock, queue, sn):
    while True:
        with lock:
            (new_container_name, blob_name, new_blob_url) = queue.get()
        if new_container_name == None:
            break
        print(sn, new_container_name, blob_name, new_blob_url)
        blob_service.copy_blob(new_container_name, blob_name, new_blob_url)

if __name__ == '__main__':
    num_of_workers = 4 # the number of workers what you want, for example, 4 is my cpu core count
    lock = multiprocessing.Lock()
    queue = multiprocessing.Queue()
    multiprocessing.Process(target = putBlobCopyTriples, args = (queue, num_of_workers)).start()
    workers = [multiprocessing.Process(target = copyWorker, args = (lock, queue, i)) for i in range(num_of_workers)]
    for p in workers:
        p.start()

Note: Except cpu core count on your environment, the copy-speed limits is depended on your IO bandwidth. The worker number is not the more, the better. Recommanded that the number is equals or less than your cpu count or hyper-threading count.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM