I'm trying to delete a lot of files in s3. I am planning on using a multiprocessing.Pool
for doing all these deletes, but I'm not sure how to keep the s3.client
alive between jobs. I'm wanting to do something like
import boto3
import multiprocessing as mp
def work(key):
s3_client = boto3.client('s3')
s3_client.delete_object(Bucket='bucket', Key=key)
with mp.Pool() as pool:
pool.map(work, lazy_iterator_of_billion_keys)
But the problem with this is that a significant amount of time is spent doing the s3_client = boto3.client('s3')
at the start of each job. The documentation says to make a new resource instance for each process so I need a way to make a s3 client for each process.
Is there any way to make a persistent s3 client for each process in the pool or cache the clients?
Also, I am planning on optimizing the deletes by sending batches of keys and using s3_client.delete_objects
, but showed s3_client.delete_object
in my example for simplicity.
Check this snippet from the RealPython concurrency tutorial. They create a single request Session for each process since you cannot share resources because each pool has its own memory space. Instead, they create a global session object to initialize the multiprocessing pool, otherwise, each time the function is called it would instantiate a Session object which is an expensive operation.
So, following that logic, you could instantiate the boto3 client that way and you would only create one client per process.
import requests
import multiprocessing
import time
session = None
def set_global_session():
global session
if not session:
session = requests.Session()
def download_site(url):
with session.get(url) as response:
name = multiprocessing.current_process().name
print(f"{name}:Read {len(response.content)} from {url}")
def download_all_sites(sites):
with multiprocessing.Pool(initializer=set_global_session) as pool:
pool.map(download_site, sites)
if __name__ == "__main__":
sites = [
"https://www.jython.org",
"http://olympus.realpython.org/dice",
] * 80
start_time = time.time()
download_all_sites(sites)
duration = time.time() - start_time
print(f"Downloaded {len(sites)} in {duration} seconds")
I ended up solving this using functools.lru_cache
and a helper function for getting the s3 client. An LRU cache will stay consistent in a process, so it will preserve the connection. The helper function looks like
from functools import lru_cache
@lru_cache()
def s3_client():
return boto3.client('s3')
and then that is called in my work
function like
def work(key):
s3_client = s3_client()
s3_client.delete_object(Bucket='bucket', Key=key)
I was able to test this and benchmark it in the following way:
import os
from time import time
def benchmark(key):
t1 = time()
s3 = get_s3()
print(f'[{os.getpid()}] [{s3.head_object(Bucket='bucket', Key=key)}] :: Total time: {time() - t1} s')
with mp.Pool() as p:
p.map(benchmark, big_list_of_keys)
And this result showed that the first function call for each pid would take about 0.5 seconds and then subsequent calls for the same pid would take about 2e-6 seconds. This was proof enough to me that the client connection was being cached and working as I expected.
Interestingly, if I don't have @lru_cache()
on s3_client()
then subsequent calls would take about 0.005 seconds, so there must be some internal caching that happens automatically with boto3 that I wasn't aware of.
And for testing purposes, I benchmarked Milton's answer in the following way
s3 = None
def set_global_session():
global s3
if not s3:
s3 = boto3.client('s3')
with mp.Pool(initializer=set_global_session) as p:
p.map(benchmark, big_list_of_keys)
And this also had averaging 3e-6 seconds per job, so pretty much the same as using functools.lru_cache
on a helper function.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.