简体   繁体   English

如何在 Python 中高效地将小文件上传到 Amazon S3

[英]How to upload small files to Amazon S3 efficiently in Python

Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible.最近,我需要实现一个程序,以尽快将驻留在 Amazon EC2 中的文件上传到 Python 中的 S3。 And the size of files are 30KB.文件大小为30KB。

I have tried some solutions, using multiple threading, multiple processing, co-routine.我尝试了一些解决方案,使用多线程、多处理、协程。 The following is my performance test result on Amazon EC2.以下是我在 Amazon EC2 上的性能测试结果。

3600 (the amount of files) * 30K (file size) ~~ 105M (Total) ---> 3600(文件数量)*30K(文件大小)~~105M(总计)--->

       **5.5s [ 4 process + 100 coroutine ]**
       10s  [ 200 coroutine ]
       14s  [ 10 threads ]

The code as following shown代码如下所示

For multithreading用于多线程

def mput(i, client, files):
    for f in files:
        if hash(f) % NTHREAD == i:
            put(client, os.path.join(DATA_DIR, f))


def test_multithreading():
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)]
    for th in ths:
        th.daemon = True
        th.start()
    for th in ths:
        th.join()

For coroutine对于协程

client = connect_to_s3_sevice()
pool = eventlet.GreenPool(int(sys.argv[2]))

xput = functools.partial(put, client)
files = os.listdir(DATA_DIR)
for f in files:
    pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()

For multiprocessing + Coroutine对于多处理 + 协程

def pproc(i):
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    pool = eventlet.GreenPool(100)

    xput = functools.partial(put, client)
    for f in files:
        if hash(f) % NPROCESS == i:
            pool.spawn_n(xput, os.path.join(DATA_DIR, f))
    pool.waitall()


def test_multiproc():
    procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)]
    for p in procs:
        p.daemon = True
        p.start()
    for p in procs:
        p.join()

The configuration of the machine is Ubuntu 14.04, 2 CPUs (2.50GHz), 4G Memory机器配置为Ubuntu 14.04,2个CPU(2.50GHz),4G内存

The highest speed reached is about 19Mb/s (105 / 5.5) .达到的最高速度约为19Mb/s (105 / 5.5) Overall, it is too slow.总的来说,速度太慢了。 Any way to speed it up?有什么办法可以加快速度? Does stackless python could do it faster? stackless python 可以做得更快吗?

Sample parallel upload times to Amazon S3 using the Python boto SDK are available here:此处提供了使用 Python boto SDK 并行上传到 Amazon S3 的示例:

Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI) , which can do uploads in parallel.除了自己编写代码,您还可以考虑调用AWS 命令​​行界面 (CLI) ,它可以并行上传。 It is also written in Python and uses boto.它也是用 Python 编写的并使用 boto。

I recently needed to upload about 5 TB of small files to AWS and reached full network bandwidth ~750Mbits (1 Gb connection per server) without problems by setting a higher "max_concurrent_request" value in the ~/.aws/config file.我最近需要将大约 5 TB 的小文件上传到 AWS,并通过在 ~/.aws/config 文件中设置更高的“max_concurrent_request”值来达到 ~750Mbits(每台服务器 1 Gb 连接)的完整网络带宽。

I further speeded up the process by starting multiple upload jobs via a bash for-loop und sending these jobs to different servers.我通过 bash for 循环启动多个上传作业并将这些作业发送到不同的服务器,从而进一步加快了进程。

I also tried python eg.我也试过 python 例如。 s3-parallel-put, but i think this approach is way faster. s3-parallel-put,但我认为这种方法更快。 Of course if the files are too small one should consider: Compressing --> upload to EBS /S3 and decompress there当然,如果文件太小,应该考虑:压缩 --> 上传到 EBS /S3 并在那里解压

Here is some code that might help.这是一些可能有帮助的代码。

$cat .aws/config 
[default]
region = eu-west-1
output = text
s3 =
    max_concurrent_requests = 100

Than start multiple aws copy jobs, eg.:比启动多个 aws 复制作业,例如:

for folder in `ls`; do aws s3 cp $folder s3://<bucket>/$folder/whatever/; done

I have the same problem as You.我和你有同样的问题。 My solution was send the data to AWS SQS and then save them to S3 using AWS Lambda.我的解决方案是将数据发送到 AWS SQS,然后使用 AWS Lambda 将它们保存到 S3。

So data flow looks: app -> SQS -> Lambda -> S3所以数据流看起来:app -> SQS -> Lambda -> S3

Entire process is asynchronous, but near real-time :)整个过程是异步的,但接近实时:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM