简体   繁体   中英

How to convert lots of HTML to JSON with asyncio in Python

This is my first attempt at using asyncio in python. Objective is to convert 40000+ htmls to jsons. Using a synchronous for loop takes about 3.5 minutes. I am interested to see the performance boost using asyncio. I am using the following code:

import glob
import json
from parsel import Selector
import asyncio
import aiofiles

async def read_html(path):
    async with aiofiles.open(path, 'r') as f:
        html = await f.read()
    return html


async def parse_job(path):
    html = await read_html(path)
    sel_obj = Selector(html)
    jobs = dict()
    jobs['some_var'] = sel_obj.xpath('some-xpath').get()
    return jobs


async def write_json(path):
    job = await parse_job(path)
    async with aiofiles.open(file_name.replace("html","json"), "w") as f:
        await f.write(job)


async def bulk_read_and_write(files):
    # this function is from realpython tutorial. 
    # I have little understanding of whats going on with gather()
    tasks = list()
    for file in files:
        tasks.append(write_json(file))
    await asyncio.gather(*tasks)


if __name__ == "__main__":
    files = glob.glob("some_folder_path/*.html")
    asyncio.run(bulk_read_and_write(files))

After a few seconds of running, I am getting the following error.

Traceback (most recent call last):
  File "06_extract_jobs_async.py", line 84, in <module>
    asyncio.run(bulk_read_and_write(files))
  File "/anaconda3/envs/py37/lib/python3.7/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/anaconda3/envs/py37/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
    return future.result()
  File "06_extract_jobs_async.py", line 78, in bulk_read_and_write
    await asyncio.gather(*tasks)
  File "06_extract_jobs_async.py", line 68, in write_json
    job = await parse_job(path)
  File "06_extract_jobs_async.py", line 35, in parse_job
    html = await read_html(path)
  File "06_extract_jobs_async.py", line 29, in read_html
    async with aiofiles.open(path, 'r') as f:
  File "/anaconda3/envs/py37/lib/python3.7/site-packages/aiofiles/base.py", line 78, in __aenter__
    self._obj = yield from self._coro
  File "/anaconda3/envs/py37/lib/python3.7/site-packages/aiofiles/threadpool/__init__.py", line 35, in _open
    f = yield from loop.run_in_executor(executor, cb)
  File "/anaconda3/envs/py37/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
OSError: [Errno 24] Too many open files: '../html_output/jobs/6706538_478752_job.html'

What is going on here? Thanks in advance

You're making async calls as fast as you can, but the process of writing a file to disk is still an effectively synchronous task. Your OS can try to perform multiple writes at once, but there is a limit. By spawning async tasks as quickly as possible, you're getting a lot of results at once, meaning a huge amount of files open for writing at the same time. As your error suggests, there is a limit.

There are lots of good threads on here about limiting concurrency with asyncio, but the easiest solution is probably asyncio-pool with a reasonable size.

Try adding a limit to the number of parallel tasks:

# ...rest of code unchanged

async def write_json(path, limiter):
    with limiter:
        job = await parse_job(path)
        async with aiofiles.open(file_name.replace("html","json"), "w") as f:
            await f.write(job)

async def bulk_read_and_write(files):
    limiter = asyncio.Semaphore(1000)
    tasks = []
    for file in files:
        tasks.append(write_json(file, limiter))
    await asyncio.gather(*tasks)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM