Python asyncio 運行速度較慢

Question

我是 Python 和並行執行和異步的新手。 我做錯了嗎？ 我的代碼運行速度較慢（或充其量等於）以傳統方式運行腳本所需的時間，沒有 asyncio。

import asyncio, os, time, pandas as pd
start_time = time.time()

async def main():
    coroutines = list()
    for root, dirs, files in os.walk('.', topdown=True):
        for file in files:
            coroutines.append(cleaner(file))
        await asyncio.gather(*coroutines)

async def cleaner(file):
 df = pd.read_csv(file, sep='\n', header=None, engine='python', quoting=3)
 df = df[0].str.strip(' \t"').str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'}) 
 df[['email', 'data']].to_csv('x1', sep=':', index=False, header=False, mode='a', compression='gzip')


asyncio.run(main())
print("--- %s seconds ---" % (time.time() - start_time))

Answer 1

您的工作負載似乎是讀取文件 --> 使用 pandas 處理 --> 寫入文件。 這是多處理的理想選擇，因為每個工作項都非常獨立。 像任何阻塞操作一樣，讀/寫文件系統的pandas例程不是 asyncio 的良好候選者，除非您在 asyncio 的線程或進程池中運行它們。

相反，這些多個操作是真正的並行執行的良好候選者，而 asyncio 沒有給你。 （它的線程和進程池也是不錯的選擇）。

import multiprocessing as mp
import os

def walk_all_files(path):
    for root, dirs, files in os.walk('.', topdown=True):
        for file in files:
            yield os.path.join(root, file)

def cleaner(path):
    return "sparkly"

def clean_all(path="."):
    files = list(walk_all_files(path))
    # using cpu*2 assuming that there is a lot of cpu heavy
    # work that can be done by some processes while others
    # wait on I/O. This is only a guess.
    cpu_count = min(len(files), mp.cpu_count()*2)
    with mp.Pool(cpu_count) as pool:
        # assuming processing is fairly long but also kindof random depending on
        # file contents, setting chunksize to 1 so that subprocess gets new work
        # item from parent on each round. You could set it higher to have fewer
        # interactions between parent and worker.
        result = pool.map(cleaner, files, chunksize=1)

if __name__ == "__main__":
    clean_all(".")

Python asyncio 運行速度較慢

問題描述

1 個解決方案

解決方案1
1 2020-06-14 06:35:36

Python asyncio 運行速度較慢

問題描述

1 個解決方案

解決方案1 1 2020-06-14 06:35:36

解決方案1
1 2020-06-14 06:35:36