简体   繁体   English

使用 boto 将多个文件并行上传到 s3

[英]Uploading the multiples files in parallel to s3 using boto

http://ls.pwd.io/2013/06/parallel-s3-uploads-using-boto-and-threads-in-python/ http://ls.pwd.io/2013/06/parallel-s3-uploads-using-boto-and-threads-in-python/

I tried the second solution mentioned in the link to upload the multiple files to s3.我尝试了链接中提到的第二个解决方案将多个文件上传到 s3。 The code mentioned in this link doesn't call method "join" on the threads which means main program can get terminated even though the threads are running.此链接中提到的代码不会在线程上调用方法“join”,这意味着即使线程正在运行,主程序也可以终止。 Using this approach the overall program gets executed much faster but doesn't guaranteee if the files are uploaded correctly or not.使用这种方法,整个程序的执行速度要快得多,但不能保证文件是否正确上传。 Is it really true?真的吗? I am more concerned about the main program finishing fast?我比较关心主程序整理快吗? What side effects can be there using this approach?使用这种方法会有什么副作用?

just having a little play, and I see multiprocessing takes a while to tear down a Pool, but otherwise not much in it只是玩了一会儿,我看到multiprocessing需要一段时间才能拆除池,但除此之外并没有太多

test code is:测试代码是:

from time import time, sleep
from multiprocessing.pool import Pool, ThreadPool
from threading import Thread


N_WORKER_JOBS = 10


def worker(x):
    # print("working on", x)
    sleep(0.1)


def mp_proc(fn, n):
    start = time()
    with Pool(N_WORKER_JOBS) as pool:
        t1 = time() - start
        pool.map(fn, range(n))
        start = time()
    t2 = time() - start
    print(f'Pool creation took {t1*1000:.2f}ms, teardown {t2*1000:.2f}ms')


def mp_threads(fn, n):
    start = time()
    with ThreadPool(N_WORKER_JOBS) as pool:
        t1 = time() - start
        pool.map(fn, range(n))
        start = time()
    t2 = time() - start
    print(f'ThreadPool creation took {t1*1000:.2f}ms, teardown {t2*1000:.2f}ms')


def threads(fn, n):
    threads = []
    for i in range(n):
        t = Thread(target=fn, args=(i,))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()


for test in [mp_proc, mp_threads, threads]:
    times = []
    for _ in range(7):
        start = time()
        test(worker, 10)
        times.append(time() - start)

    times = ', '.join(f'{t*1000:.2f}' for t in times)
    print(f'{test.__name__} took {times}ms')

I get the following timings (Python 3.7.3, Linux 5.0.8):我得到以下时间(Python 3.7.3,Linux 5.0.8):

  • mp_proc ~220ms mp_proc ~220ms
  • mp_threads ~200ms mp_threads ~ mp_threads毫秒
  • threads ~100ms threads ~100ms

however the teardown times are all ~100ms, which brings everything mostly into line.然而,拆卸时间都是 ~100 毫秒,这使所有内容大部分都符合要求。

I've poked around with logging and in the source, and it seems to be due to _handle_workers only checking every 100ms (it does status checks then sleeps for 0.1 seconds).我已经在日志记录和源代码中_handle_workers ,这似乎是由于_handle_workers仅每 100 毫秒检查一次(它进行状态检查,然后休眠 0.1 秒)。

with this knowledge, I can change the code to sleep for 0.095 seconds, then everything is within 10% of each other.有了这些知识,我可以将代码更改为休眠 0.095 秒,然后一切都在 10% 以内。 also, given that this is just once at pool tear down it's easy to arrange for this not to happen in an inner loop此外,鉴于这只是在池拆除时发生的一次,很容易安排这不会发生在内部循环中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM