简体   繁体   中英

How to upload small files to Amazon S3 efficiently in Python

Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible. And the size of files are 30KB.

I have tried some solutions, using multiple threading, multiple processing, co-routine. The following is my performance test result on Amazon EC2.

3600 (the amount of files) * 30K (file size) ~~ 105M (Total) --->

       **5.5s [ 4 process + 100 coroutine ]**
       10s  [ 200 coroutine ]
       14s  [ 10 threads ]

The code as following shown

For multithreading

def mput(i, client, files):
    for f in files:
        if hash(f) % NTHREAD == i:
            put(client, os.path.join(DATA_DIR, f))


def test_multithreading():
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)]
    for th in ths:
        th.daemon = True
        th.start()
    for th in ths:
        th.join()

For coroutine

client = connect_to_s3_sevice()
pool = eventlet.GreenPool(int(sys.argv[2]))

xput = functools.partial(put, client)
files = os.listdir(DATA_DIR)
for f in files:
    pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()

For multiprocessing + Coroutine

def pproc(i):
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    pool = eventlet.GreenPool(100)

    xput = functools.partial(put, client)
    for f in files:
        if hash(f) % NPROCESS == i:
            pool.spawn_n(xput, os.path.join(DATA_DIR, f))
    pool.waitall()


def test_multiproc():
    procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)]
    for p in procs:
        p.daemon = True
        p.start()
    for p in procs:
        p.join()

The configuration of the machine is Ubuntu 14.04, 2 CPUs (2.50GHz), 4G Memory

The highest speed reached is about 19Mb/s (105 / 5.5) . Overall, it is too slow. Any way to speed it up? Does stackless python could do it faster?

Sample parallel upload times to Amazon S3 using the Python boto SDK are available here:

Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI) , which can do uploads in parallel. It is also written in Python and uses boto.

I recently needed to upload about 5 TB of small files to AWS and reached full network bandwidth ~750Mbits (1 Gb connection per server) without problems by setting a higher "max_concurrent_request" value in the ~/.aws/config file.

I further speeded up the process by starting multiple upload jobs via a bash for-loop und sending these jobs to different servers.

I also tried python eg. s3-parallel-put, but i think this approach is way faster. Of course if the files are too small one should consider: Compressing --> upload to EBS /S3 and decompress there

Here is some code that might help.

$cat .aws/config 
[default]
region = eu-west-1
output = text
s3 =
    max_concurrent_requests = 100

Than start multiple aws copy jobs, eg.:

for folder in `ls`; do aws s3 cp $folder s3://<bucket>/$folder/whatever/; done

I have the same problem as You. My solution was send the data to AWS SQS and then save them to S3 using AWS Lambda.

So data flow looks: app -> SQS -> Lambda -> S3

Entire process is asynchronous, but near real-time :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM