[英]Simultaneous execution of different functions with multiprocessing in Python
我有一個 class 主要更新數據並對其進行分析,每 30 秒連續一次。 關鍵是,由於您在偽代碼中看到的函數必須保持獨立,因此我不得不下載所有數據然后對其進行分析。 我的目標是重寫代碼,以便一旦代碼的數據可用,function 立即解析它。 此工作流程在偽代碼 02 中進行了說明,但是否可以在不合並功能的情況下進行? 感謝您的建議:)
偽代碼01:
import datetime, multiprocessing
class myclass:
def __init__(self, list_of_symbols):
self.last_update = 0
self.list_of_symbols = list_of_symbols
def get_data(self, symbol):
request_data_to_api(symbol)
def analyze_data(self, symbol):
analyze(symbol)
def run():
while True:
if (self.last_update == 0) or ((datetime.datetime.now() - self.last_update).seconds >= 30):
self.last_update = datetime.datetime.now()
# Update data as soon as possible
pool = multiprocessing.Pool(20)
pool.map(self.get_data, self.list_of_symbols)
pool.close()
pool.join()
# Analyze data as soon as possible
pool = multiprocessing.Pool(20)
pool.map(self.analyze_data, self.list_of_symbols)
pool.close()
pool.join()
偽代碼 02:
class myclass:
def __init__(self, list_of_symbols):
self.last_update = 0
self.list_of_symbols = list_of_symbols
def get_all_in_one(self, symbol):
request_data_to_api(symbol)
analyze(symbol)
def run():
while True:
if (self.last_update == 0) or ((datetime.datetime.now() - self.last_update).seconds >= 30):
self.last_update = datetime.datetime.now()
pool = multiprocessing.Pool(20)
pool.map(self.get_all_in_one, self.list_of_symbols)
pool.close()
pool.join()
我正在假設方法get_data
是一個 I/O 綁定任務,多線程更適合(並且可以從更大的線程池中受益),而analyze_data
是 CPU 密集型的,它應該使用多處理並且應該限於您擁有的 CPU 內核數。 我還在他們的簽名和返回的內容中修改了你的方法,我認為這是必要的。
想法是(1)在循環外創建一次池,因為創建池可能很昂貴;(2)使用imap_unordered
(使用適當的chunksize參數來提高效率),以便從get_data
返回的結果可以盡快提交給analyze_data
當它們可用時:
import datetime
from multiprocessing import Pool, cpu_count
from multiprocessing.pool import ThreadPool
class myclass:
def __init__(self, list_of_symbols):
self.last_update = 0
self.list_of_symbols = list_of_symbols
def get_data(self, symbol):
data = request_data_to_api(symbol)
return symbol, data
def analyze_data(self, symbol, data):
return analyze(symbol, data)
def run(self): # added missing self argument
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
iterable_size = len(self.list_of_symbols)
OPTIMAL_THREADPOOL_SIZE = 50 # Your guess is as good as mine
threadpool_size = min(iterable_size, OPTIMAL_THREADPOOL_SIZE)
thread_pool = ThreadPool(threadpool_size)
processpool_size = min(iterable_size, cpu_count())
process_pool = Pool(processpool_size) # use all the CPU cores that are available
chunksize = compute_chunksize(iterable_size, processpool_size)
while True:
if (self.last_update == 0) or ((datetime.datetime.now() - self.last_update).seconds >= 30):
self.last_update = datetime.datetime.now()
# Update data as soon as possible
results = thread_pool.imap_unordered(self.get_data, self.list_of_symbols, chunksize)
# as results become available:
async_results = []
for symbol, data in results:
async_results.append(process_pool.apply_async(self.analyze_data, args=(symbol, data)))
for async_result in async_results:
result = async_result.get() # return value from analyze_data
筆記
如果get_data
也是 CPU 密集型的,只需創建多處理池並將thread_pool.imap_unordered(etc.
替換為process_pool.imap_unordered(etc.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.