[英]How to convert lots of HTML to JSON with asyncio in Python
This is my first attempt at using asyncio in python.这是我第一次尝试在 python 中使用 asyncio。 Objective is to convert 40000+ htmls to jsons.
目标是将 40000+ htmls 转换为 jsons。 Using a synchronous for loop takes about 3.5 minutes.
使用同步 for 循环大约需要 3.5 分钟。 I am interested to see the performance boost using asyncio.
我有兴趣看到使用 asyncio 的性能提升。 I am using the following code:
我正在使用以下代码:
import glob
import json
from parsel import Selector
import asyncio
import aiofiles
async def read_html(path):
async with aiofiles.open(path, 'r') as f:
html = await f.read()
return html
async def parse_job(path):
html = await read_html(path)
sel_obj = Selector(html)
jobs = dict()
jobs['some_var'] = sel_obj.xpath('some-xpath').get()
return jobs
async def write_json(path):
job = await parse_job(path)
async with aiofiles.open(file_name.replace("html","json"), "w") as f:
await f.write(job)
async def bulk_read_and_write(files):
# this function is from realpython tutorial.
# I have little understanding of whats going on with gather()
tasks = list()
for file in files:
tasks.append(write_json(file))
await asyncio.gather(*tasks)
if __name__ == "__main__":
files = glob.glob("some_folder_path/*.html")
asyncio.run(bulk_read_and_write(files))
After a few seconds of running, I am getting the following error.运行几秒钟后,我收到以下错误。
Traceback (most recent call last):
File "06_extract_jobs_async.py", line 84, in <module>
asyncio.run(bulk_read_and_write(files))
File "/anaconda3/envs/py37/lib/python3.7/asyncio/runners.py", line 43, in run
return loop.run_until_complete(main)
File "/anaconda3/envs/py37/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
return future.result()
File "06_extract_jobs_async.py", line 78, in bulk_read_and_write
await asyncio.gather(*tasks)
File "06_extract_jobs_async.py", line 68, in write_json
job = await parse_job(path)
File "06_extract_jobs_async.py", line 35, in parse_job
html = await read_html(path)
File "06_extract_jobs_async.py", line 29, in read_html
async with aiofiles.open(path, 'r') as f:
File "/anaconda3/envs/py37/lib/python3.7/site-packages/aiofiles/base.py", line 78, in __aenter__
self._obj = yield from self._coro
File "/anaconda3/envs/py37/lib/python3.7/site-packages/aiofiles/threadpool/__init__.py", line 35, in _open
f = yield from loop.run_in_executor(executor, cb)
File "/anaconda3/envs/py37/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
OSError: [Errno 24] Too many open files: '../html_output/jobs/6706538_478752_job.html'
What is going on here?这里发生了什么? Thanks in advance
提前致谢
You're making async calls as fast as you can, but the process of writing a file to disk is still an effectively synchronous task.您正在尽可能快地进行异步调用,但是将文件写入磁盘的过程仍然是一项有效的同步任务。 Your OS can try to perform multiple writes at once, but there is a limit.
您的操作系统可以尝试一次执行多个写入,但有一个限制。 By spawning async tasks as quickly as possible, you're getting a lot of results at once, meaning a huge amount of files open for writing at the same time.
通过尽可能快地生成异步任务,您可以一次获得大量结果,这意味着同时打开大量文件进行写入。 As your error suggests, there is a limit.
正如您的错误所暗示的那样,有一个限制。
There are lots of good threads on here about limiting concurrency with asyncio, but the easiest solution is probably asyncio-pool with a reasonable size.这里有很多关于使用 asyncio 限制并发的好线程,但最简单的解决方案可能是具有合理大小的asyncio-pool 。
Try adding a limit to the number of parallel tasks:尝试为并行任务的数量添加限制:
# ...rest of code unchanged
async def write_json(path, limiter):
with limiter:
job = await parse_job(path)
async with aiofiles.open(file_name.replace("html","json"), "w") as f:
await f.write(job)
async def bulk_read_and_write(files):
limiter = asyncio.Semaphore(1000)
tasks = []
for file in files:
tasks.append(write_json(file, limiter))
await asyncio.gather(*tasks)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.