简体   繁体   English

如何在 Python 中使用线程来并行化 AWS S3 API 调用?

[英]How can I use threading in Python to parallelize AWS S3 API calls?

I wrote a Python script that seeks to determine the total size of a all available AWS S3 buckets by making use of the AWS Boto 3 list_objects() method .我编写了一个 Python 脚本,试图通过使用AWS Boto 3 list_objects() 方法来确定所有可用 AWS S3 存储桶的总大小。

The logic is simple:逻辑很简单:

  1. Get the initial list of objects from each S3 bucket (automatically truncated after 1,000 objects)从每个 S3 存储桶中获取对象的初始列表(在 1,000 个对象后自动截断)
  2. Iterate through each object in the list of objects, adding the size of that object to a total_size variable遍历对象列表中的每个对象,将该对象的大小添加到 total_size 变量中
  3. While the bucket still has additional objects, retrieve them and repeat step 2当存储桶仍有其他对象时,检索它们并重复步骤 2

Here's the relevant code snippet:这是相关的代码片段:

import boto3

s3_client = boto3.client('s3')

# Get all S3 buckets owned by the authenticated sender of the request
buckets = s3_client.list_buckets()

# For each bucket...
for bucket in buckets['Buckets']:
    # Get up to first 1,000 objects in bucket
    bucket_objects = s3_client.list_objects(Bucket=bucket['Name'])

    # Initialize total_size
    total_size = 0

    # Add size of each individual item in bucket to total size
    for obj in bucket_objects['Contents']:
        total_size += obj['Size']

    # Get additional objects from bucket, if more
    while bucket_objects['IsTruncated']:
        # Get next 1,000 objects, starting after final object of current list
        bucket_objects = s3_client.list_objects(
            Bucket=bucket['Name'],
            Marker=bucket_objects['Contents'][-1]['Key'])
        for obj in bucket_objects['Contents']:
            total_size += obj['Size']

    size_in_MB = total_size/1000000.0
    print('Total size of objects in bucket %s: %.2f MB'
        % (bucket['Name'], size_in_MB))

This code runs relatively quickly on buckets that have less than 5 MB or so of data in them, however when I hit a bucket that has 90+ MB of data in it, execution jumps up from milliseconds to 20-30+ seconds.这段代码在数据少于 5 MB 左右的存储桶上运行速度相对较快,但是当我命中一个包含 90+ MB 数据的存储桶时,执行时间从毫秒跳到 20-30+ 秒。

My hope was to use the threading module to parallelize the I/O portion of the code (getting the list of objects from S3) so that the total size of all objects in the bucket could be added as soon as the thread retrieving them completed rather than having to do that retrieval and addition sequentially.我希望使用线程模块来并行化代码的 I/O 部分(从 S3 获取对象列表),以便在检索它们的线程完成后立即添加存储桶中所有对象的总大小而不是而不是必须按顺序进行检索和添加。

I understand that Python doesn't support true multithreading because of the GIL, just to avoid getting responses to that effect, but my understanding is that since this is an I/O operation as opposed to a CPU-intensive operation, the threading module should be able to improve the run time.我知道 Python 由于 GIL 不支持真正的多线程,只是为了避免获得对该效果的响应,但我的理解是,由于这是 I/O 操作而不是 CPU 密集型操作,因此线程模块应该能够提高运行时间。

The main difference between my problem and the several examples I've seen on here of threading implementations is that I'm not iterating over a known list or set.我的问题与我在这里看到的几个线程实现示例之间的主要区别在于,我没有迭代已知列表或集合。 Here I must first retrieve a list of objects, see if the list is truncated, and then retrieve the next list of objects based off of the final object's key in the current list.在这里,我必须首先检索一个对象列表,查看该列表是否被截断,然后根据当前列表中的最终对象的键检索下一个对象列表。

Can anyone explain a way to improve the run time of this code, or is it not possible in this situation?谁能解释一种改善此代码运行时间的方法,或者在这种情况下不可能?

I ran into similar problems.我遇到了类似的问题。

It seems to be important to create a separate session for each thread.为每个线程创建一个单独的会话似乎很重要。

So instead of所以代替

s3_client = boto3.client('s3')

you need to write你需要写

s3_client = boto3.session.Session().client('s3')

otherwise threads interfere with each other, and random errors occur.否则线程相互干扰,会出现随机错误。

Beyond that the normal issues of multithreading apply.除此之外,多线程的正常问题也适用。

My project is upload 135,000 files to an S3 bucket.我的项目是将 135,000 个文件上传到 S3 存储桶。 So far I have found that I get the best performance with 8 threads.到目前为止,我发现使用 8 个线程可以获得最佳性能。 What would otherwise take 3.6 hours, takes 1.25 hours.否则需要 3.6 小时,现在需要 1.25 小时。

I have a solution which may not work in all cases but can cover a good deal of scenarios.我有一个解决方案,它可能不适用于所有情况,但可以涵盖很多场景。 If you have objects organised hierarchically in subfolders, then first only list subfolders using mechanism described in this post如果您在子文件夹分层组织,那么对象首只使用机制列表子描述在这个岗位

Then using these obtained set of prefixes submit them to a multiprocessing pool (or Thread Pool) where each worker will fetch all keys specific to one prefix and collect them in a shared container using multiprocessing Manager.然后使用这些获得的前缀集将它们提交到多处理池(或线程池),其中每个工作人员将获取特定于一个前缀的所有键,并使用多处理管理器将它们收集到共享容器中。 In this way keys will be fetched in parallel.这样,密钥将被并行获取。

Above solution will perform best if keys are distributed evenly and hierarchically and worst if data is organized flat.如果密钥分布均匀且分层,则上述解决方案将表现最佳,如果数据组织扁平,则效果最差。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM