简体   繁体   English

如何将异步生成器合并到python 3.5+中的vanilla生成器中

[英]How to merge async generators into a vanilla generator in python 3.5+

I'm having trouble combining async generators and actually running them. 我在组合异步生成器和实际运行它们时遇到了麻烦。 This is because the only way I 've found to run them is through an event loop which returns an iterable and not a generator. 这是因为我发现运行它们的唯一方法是通过一个返回可迭代而不是生成器的事件循环。 Let me illustrate this with a simple example: 让我用一个简单的例子来说明这一点:

Let's say I have a function google_search that searches google by scraping it (I'm not using the API on purpose). 假设我有一个google_search函数,可以通过搜索来搜索谷歌(我没有故意使用API​​)。 It takes in a search string and returns a generator of search results. 它接收搜索字符串并返回搜索结果的生成器。 This generator doesn't end when the page is over, the function continues by going on to the next page. 当页面结束时,此生成器不会结束,该功能将继续到下一页。 Therefore the google_search function returns a possibly nearly endless generator (it will technically always end but often you can get millions of hits for a search on google) 因此google_search函数返回一个可能几乎无穷无尽的生成器(它在技术上总是会结束,但通常你可以在谷歌搜索获得数百万次点击)

def google_search(search_string):
    # Basically uses requests/aiohttp and beautifulsoup
    # to parse the resulting html and yield search results
    # Assume this function works
    ......

Okay, so now I want to make a function that allows me to iterate over multiple google_search generators. 好的,现在我想创建一个允许我迭代多个google_search生成器的函数。 I'd like something like this: 我想要这样的事情:

def google_searches(*search_strings):
    for results in zip(google_search(query) for query in search_strings):
        yield results

This way I can use a simple for loop to unwind google_searches and get my results. 这样我就可以使用一个简单的for循环来展开google_searches并获得我的结果。 And the above code works well but is very slow for any reasonably big number of searches. 上面的代码运行良好,但对于任何相当多的搜索都非常慢。 The code is sending a request for the first search, then the second and so forth until finally, it yields results. 代码正在发送第一次搜索的请求,然后是第二次搜索等等,直到最后,它产生结果。 I would like to speed this up (a lot). 我想加快速度(很多)。 My first idea is to change google_searches to an async function (I am using python 3.6.3 and can use await/async etc). 我的第一个想法是将google_searches更改为异步函数(我使用的是python 3.6.3并且可以使用await / async等)。 This then creates an async generator which is fine but I can only run it in another async function or an event loop. 然后创建一个异步生成器,但是我只能在另一个异步函数或事件循环中运行它。 And running it in an event loop with run_until_complete(loop.gather(...)) returns a list of results instead of a normal generator, which defeats the purpose as there's probably way too many search results to hold in a list. 使用run_until_complete(loop.gather(...))在事件循环中运行它会返回结果列表而不是普通生成器,这会使目的失败,因为可能会有太多搜索结果保留在列表中。

How can I make the google_searches function faster (using preferably async code but anything is welcomed) by executing requests asynchronously while still having it be a vanilla generator? 如何通过异步执行请求同时使其成为一个香草生成器,使得google_searches功能更快(优选使用异步代码,但欢迎任何事情)? Thanks in advance! 提前致谢!

The accepted answer waits for one result from EACH async generator before calling the generators again. 接受的答案在再次调用生成器之前等待来自EACH异步生成器的一个结果。 If data doesn't come at the same exact same pace, that may be a problem. 如果数据的速度不同,则可能存在问题。 The solution below takes multiple async iterables (generators or not) and iterates all of them simultaneously in multiple coroutines. 下面的解决方案需要多个异步迭代(生成器或不生成器),并在多个协同程序中同时迭代它们。 Each coroutine puts the results in a asyncio.Queue , which is then iterated by the client code: 每个协程都将结果放在asyncio.Queue ,然后由客户端代码迭代:

Iterator code: 迭代器代码:

import asyncio
from async_timeout import timeout

class MergeAsyncIterator:
    def __init__(self, *it, timeout=60, maxsize=0):
        self._it = [self.iter_coro(i) for i in it]
        self.timeout = timeout
        self._futures = []
        self._queue = asyncio.Queue(maxsize=maxsize)

    def __aiter__(self):
        for it in self._it:
            f = asyncio.ensure_future(it)
            self._futures.append(f)
        return self

    async def __anext__(self):
        if all(f.done() for f in self._futures) and self._queue.empty():
            raise StopAsyncIteration
        with timeout(self.timeout):
            try:
                return await self._queue.get()
            except asyncio.CancelledError:
                raise StopAsyncIteration

    def iter_coro(self, it):
        if not hasattr(it, '__aiter__'):
            raise ValueError('Object passed must be an AsyncIterable')
        return self.aiter_to_queue(it)

    async def aiter_to_queue(self, ait):
        async for i in ait:
            await self._queue.put(i)
            await asyncio.sleep(0)

Sample client code: 示例客户端代码:

import random
import asyncio
from datetime import datetime

async def myaiter(name):
    for i in range(5):
        n = random.randint(0, 3)
        await asyncio.sleep(0.1 + n)
        yield (name, n)
    yield (name, 'DONE')

async def main():
    aiters = [myaiter(i) for i in 'abc']
    async for i in MergeAsyncIterator(*aiters, timeout=3):
        print(datetime.now().strftime('%H:%M:%S.%f'), i)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Output: 输出:

14:48:28.638975 ('a', 1)
14:48:29.638822 ('b', 2)
14:48:29.741651 ('b', 0)
14:48:29.742013 ('a', 1)
14:48:30.639588 ('c', 3)
14:48:31.742705 ('c', 1)
14:48:31.847440 ('b', 2)
14:48:31.847828 ('a', 2)
14:48:31.847960 ('c', 0)
14:48:32.950166 ('c', 1)
14:48:33.948791 ('a', 2)
14:48:34.949339 ('b', 3)
14:48:35.055487 ('c', 2)
14:48:35.055928 ('c', 'DONE')
14:48:36.049977 ('a', 2)
14:48:36.050481 ('a', 'DONE')
14:48:37.050415 ('b', 2)
14:48:37.050966 ('b', 'DONE')

PS: The code above uses the async_timeout third-party library. PS:上面的代码使用async_timeout第三方库。
PS2: The aiostream library does the same as the above code and much more. PS2: aiostream库与上面的代码完全相同。

def google_search(search_string):
    # Basically uses requests/aiohttp and beautifulsoup

This is plain synchronous generator. 这是普通的同步发电机。 You would be able to use requests inside it, but if you want to use asynchronous aiohttp , you would need asynchronous generator defined with async def . 您可以在其中使用requests ,但是如果您想使用异步aiohttp ,则需要使用async def定义异步生成器

What comes to iterating over multiple async generators it's more interesting. 什么来迭代多个异步生成器它更有趣。 You can't use plain zip since it works with plain iterables, not async iterables. 你不能使用普通的zip因为它适用于普通的迭代,而不是异步迭代。 So you should implement your own (that would also support iterating concurrently). 所以你应该实现自己的(也支持并发迭代)。

I made a little prototype that I think does what you want: 我制作了一个我认为可以做你想要的原型:

import asyncio
import aiohttp
import time


# async versions of some builtins:
async def anext(aiterator):
    try:
        return await aiterator.__anext__()
    except StopAsyncIteration as exc:
        raise exc


def aiter(aiterable):
    return aiterable.__aiter__()


async def azip(*iterables):
    iterators = [aiter(it) for it in iterables]
    while iterators:
        results = await asyncio.gather(
            *[anext(it) for it in iterators],
            return_exceptions=True,
        )
        yield tuple(results)


# emulating grabbing:
async def request(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            return await resp.text()


async def google_search(search_string):
    for i in range(999):  # big async generator
        url = 'http://httpbin.org/delay/{}'.format(i)  # increase delay to better see concurency
        j = await request(url)
        yield search_string + ' ' + str(i)


async def google_searches(*search_strings):
    async for results in azip(*[google_search(s) for s in search_strings]):
        for result in results:
            yield result


# test it works:
async def main():
    async for result in google_searches('first', 'second', 'third'):
        print(result, int(time.time()))


loop = asyncio.get_event_loop()
try:
    loop.run_until_complete(main())
    loop.run_until_complete(loop.shutdown_asyncgens())
finally:
    loop.close()

Output: 输出:

first 0 1514759561
second 0 1514759561
third 0 1514759561
first 1 1514759562
second 1 1514759562
third 1 1514759562
first 2 1514759564
second 2 1514759564
third 2 1514759564
first 3 1514759567
second 3 1514759567
third 3 1514759567

Time shows that different searches run concurrently. 时间显示不同的搜索同时运行。

I am just going to paste here the solution I coded a while ago because I always end up in this question just to remember I already solved this problem before. 我只是要粘贴我刚才编写的解决方案,因为我总是在这个问题中结束,只是为了记住我之前已经解决了这个问题。

async def iterator_merge(iterators: typing.Dict[typing.AsyncIterator, typing.Optional[asyncio.Future]]):
while iterators:
    for iterator, value in list(iterators.items()):
        if not value:
            iterators[iterator] = asyncio.ensure_future(iterator.__anext__())

    tasks, _ = await asyncio.wait(iterators.values(), return_when=asyncio.FIRST_COMPLETED)
    for task in tasks:
        # We send the result up
        try:
            res = task.result()
            yield res
        except StopAsyncIteration:
            # We remove the task from the list
            for it, old_next in list(iterators.items()):
                if task is old_next:
                    logger.debug(f'Iterator {it} finished consuming')
                    iterators.pop(it)
        else:
            # We remove the task from the key
            for it, old_next in list(iterators.items()):
                if task is old_next:
                    iterators[it] = None

It has typing annotations, but I think it's a good solution this one. 它有打字注释,但我认为这是一个很好的解决方案。 It's meant to be called with your async generators as keys, and a future if you have any to wait. 它意味着使用异步生成器作为键调用,如果有任何等待,则需要将来调用。

iterators = {
    k8s_stream_pod_log(name=name): None,
    k8s_stream_pod_events(name=name): None,
}

You can find it how I use it in github.com/txomon/abot. 您可以在github.com/txomon/abot中找到它的使用方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM