[英]Multithreading / multiprocessing with a python loop
我有一个脚本,该脚本循环访问一系列URL,以根据返回的json数据提取项目位置。 但是,该脚本需要60分钟才能运行,而其中的55分钟(每个cprofile)要花在等待json数据加载上。
我想通过多线程一次运行多个POST请求来加快速度,并最初将URL范围分为两半来完成。 我陷入困境的地方是如何实现多线程或异步。
精简代码:
import asyncio
import aiohttp
# i am not recommend to use globals
results = dict()
url = "https://www.website.com/store/ajax/search"
query = "store={}&size=18&query=17360031"
# this is default url opener got from aiohttp documentation
async def open_url(store, loop=None):
async with aiohttp.ClientSession(loop=loop) as session:
async with session.post(url, data={'searchQuery': query.format(store)}) as resp:
return await resp.json(), store
async def processing(loop=None):
# U need to use 'global' keyworld if U wan't to write to global variables
global results
# one of the simplest ways to parallelize requests, is to init Future, and when data will be ready save it to global
tasks = [open_url(store, loop=event_loop) for store in range(0, 5)]
for coro in asyncio.as_completed(tasks, loop=loop):
try:
data, store = await coro
results[store] = data['searchResults']['results'][0]['location']['aisle']
except (IndexError, KeyError):
continue
if __name__ == '__main__':
event_loop = asyncio.new_event_loop()
event_loop.run_until_complete(processing(loop=event_loop))
# Print Results
for store, data in results.items():
print(store, data)
json:
{u'count': 1,
u'results': [{u'department': {u'name': u'Home', u'storeDeptId': -1},
u'location': {u'aisle': [A], u'detailed': [A.536]},
u'score': u'0.507073'}],
u'totalCount': 1}
如果您不想并行化请求(我希望您要求这样做)。 此代码段将有所帮助。 有请求打开器,以及通过aiohttp和asyncio发送的2000个发布请求。 使用python3.5
import asyncio
import aiohttp
# i am not recommend to use globals
results = dict()
MAX_RETRIES = 5
MATCH_SLEEP_TIME = 3 # i am recommend U to move this variables to other file like constants.py or any else
url = "https://www.website.com/store/ajax/search"
query = "store={}&size=18&query=44159"
# this is default url opener got from aiohttp documentation
async def open_url(store, semaphore, loop=None):
for _ in range(MAX_RETRIES):
with await semarhore:
try:
async with aiohttp.ClientSession(loop=loop) as session:
async with session.post(url, data={'searchQuery': query.format(store)}) as resp:
return await resp.json(), store
except ConnectionResetError:
# u can handle more exceptions here, and sleep if they are raised
await asyncio.sleep(MATCH_SLEEP_TIME, loop=loop)
continue
return None
async def processing(semaphore, loop=None):
# U need to use 'global' keyworld if U wan't to write to global variables
global results
# one of the simplest ways to parallelize requests, is to init Future, and when data will be ready save it to global
tasks = [open_url(store, semaphore, loop=event_loop) for store in range(0, 2000)]
for coro in asyncio.as_completed(tasks, loop=loop):
try:
response = await coro
if response is None:
continue
data, store = response
results[store] = data['searchResults']['results'][0]['location']['aisle']
except (IndexError, KeyError):
continue
if __name__ == '__main__':
event_loop = asyncio.new_event_loop()
semaphore = asyncio.Semaphore(50, loop=event_loop) # count of concurrent requests
event_loop.run_until_complete(processing(semaphore, loop=event_loop))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.