简体   繁体   English

Asyncio熊猫与就地

[英]Asyncio Pandas with Inplace

I just read this introduction , but am having trouble implementing either of the examples (commented code being the second example): 我刚刚阅读了此介绍 ,但是在实现任何一个示例时都遇到了麻烦(注释的代码是第二个示例):

import asyncio
import pandas as pd
from openpyxl import load_workbook

async def loop_dfs(dfs):
    async def clean_df(df):
        df.drop(["column_1"], axis=1, inplace=True)
        ... a bunch of other inplace=True functions ...
        return "Done"

    # tasks = [clean_df(df) for (table, dfs) in dfs.items()]
    # await asyncio.gather(*tasks)

    tasks = [clean_df(df) for (table, df) in dfs.items()]
    completed, pending = await asyncio.wait(tasks)


def main():
    dfs = {
        sn: pd.read_excel("excel.xlsx", sheet_name=sn)
        for sn in load_workbook("excel.xlsx").sheetnames
    }

    # loop = asyncio.get_event_loop()
    # loop.run_until_complete(loop_dfs(dfs))

    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(loop_dfs(dfs))
    finally:
        loop.close()

main()

I saw a few other posts about how pandas doesn't support asyncio, and maybe i'm just missing a bigger picture, but that shouldn't matter if i'm doing inplace operations right? 我还看到了其他几篇有关熊猫如何不支持asyncio的帖子,也许我只是想念一幅大图,但是如果我执行就地操作就没关系了吗? I saw recommendations for Dask but without immediate support for reading excel, figured i'd try this first but I keep getting 我看到了针对Dask的建议,但没有立即支持阅读excel,因此我想先尝试一下,但我不断

RuntimeError: Event loop already running

I saw a few other posts about how pandas doesn't support asyncio, and maybe i'm just missing a bigger picture, but that shouldn't matter if i'm doing inplace operations right? 我还看到了其他几篇有关熊猫如何不支持asyncio的帖子,也许我只是想念一幅大图,但是如果我执行就地操作就没关系了吗?

In-place operations are those that modify existing data . 就地操作是指修改现有数据的操作 That is a matter of efficiency, whereas your goal appears to be parallelization, an entirely different matter. 那是效率问题,而您的目标似乎是并行化,这是完全不同的问题。

Pandas doesn't support asyncio not only because this wasn't yet implemented, but because Pandas doesn't typically do the kind of operations that asyncio supports well: network and subprocess IO. Pandas不支持asyncio不仅是因为尚未实现,而且还因为Pandas通常不执行asyncio很好支持的那种操作:网络和子进程IO。 Pandas functions either use the CPU or wait for disk access, neither of which is a good fit for asyncio. 熊猫函数要么使用CPU要么等待磁盘访问,但这两种都不适合异步。 Asyncio allows network communication to be expressed with coroutines that look like ordinary synchronous code. Asyncio允许使用看起来像普通同步代码的协程来表达网络通信。 Inside a coroutine each blocking operation (eg a network read) is await ed, which automatically suspends the whole task if the data is not yet available. 在协程内部, await每个阻止操作(例如,网络读取),如果尚无可用数据,则该操作会自动暂停整个任务。 At each such suspension the system switches to the next task, creating effectively a cooperative multi-tasking system. 在每次此类暂停时,系统都会切换到下一个任务,从而有效地创建协作式多任务系统。

When trying to call a library that doesn't support asyncio, such as pandas, things will superficially appear to work, but you won't get any benefit and the code will run serially. 尝试调用不支持异步的库(例如pandas)时,表面上似乎可以正常运行,但是您将无法获得任何好处,并且代码将串行运行。 For example: 例如:

async def loop_dfs(dfs):
    async def clean_df(df):
        ...    
    tasks = [clean_df(df) for (table, df) in dfs.items()]
    completed, pending = await asyncio.wait(tasks)

Since clean_df doesn't contain a single instance of await , it is a coroutine in name only - it will never actually suspend its execution to allow other coroutines to run. 由于clean_df不包含await的单个实例,因此它只是名称上的协程-实际上,它永远不会挂起其执行以允许其他协程运行。 Thus await asyncio.wait(tasks) will run the tasks in series, as if you wrote: 因此await asyncio.wait(tasks)将按await asyncio.wait(tasks)运行任务,就像您编写的一样:

for table, df in dfs.items():
    clean_df(df)

To get things to run in parallel (provided pandas occasionally releases the GIL during its operations), you should hand off the individual CPU-bound functions to a thread pool: 为了使事情并行运行(提供的熊猫偶尔会在其运行期间释放GIL ),您应该将各个CPU绑定的函数移交给线程池:

async def loop_dfs(dfs):
    def clean_df(df):  # note: ordinary def
        ...
    loop = asyncio.get_event_loop(0
    tasks = [loop.run_in_executor(clean_df, df)
             for (table, df) in dfs.items()]
    completed, pending = await asyncio.wait(tasks)

If you go down that route, you don't need asyncio in the first place, you can simply use concurrent.futures . 如果你走这条路线,你不需要ASYNCIO首先,你可以简单地使用concurrent.futures For example: 例如:

def loop_dfs(dfs):  # note: ordinary def
    def clean_df(df):
        ...
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(clean_df, df)
                   for (table, df) in dfs.items()]
        concurrent.futures.wait(futures)

figured i'd try this first but I keep getting RuntimeError: Event loop already running 想我会先尝试这个,但我不断收到RuntimeError: Event loop already running

That error typically means that you've started the script in an environment that already runs asyncio for you, such as a jupyter notebook. 该错误通常表示您已在已经为您运行asyncio的环境(例如jupyter笔记本)中启动了脚本。 If that is the case, make sure that you run your script with stock python , or consult your notebook's documentation how to change your code to submit the coroutines to the event loop that already runs. 如果是这种情况,请确保使用stock python运行脚本,或者查阅笔记本的文档,了解如何更改代码以将协程提交到已经运行的事件循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM