[英]Python asyncio runs slower
我是 Python 和並行執行和異步的新手。 我做錯了嗎? 我的代碼運行速度較慢(或充其量等於)以傳統方式運行腳本所需的時間,沒有 asyncio。
import asyncio, os, time, pandas as pd
start_time = time.time()
async def main():
coroutines = list()
for root, dirs, files in os.walk('.', topdown=True):
for file in files:
coroutines.append(cleaner(file))
await asyncio.gather(*coroutines)
async def cleaner(file):
df = pd.read_csv(file, sep='\n', header=None, engine='python', quoting=3)
df = df[0].str.strip(' \t"').str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})
df[['email', 'data']].to_csv('x1', sep=':', index=False, header=False, mode='a', compression='gzip')
asyncio.run(main())
print("--- %s seconds ---" % (time.time() - start_time))
您的工作負載似乎是讀取文件 --> 使用 pandas 處理 --> 寫入文件。 這是多處理的理想選擇,因為每個工作項都非常獨立。 像任何阻塞操作一樣,讀/寫文件系統的pandas
例程不是 asyncio 的良好候選者,除非您在 asyncio 的線程或進程池中運行它們。
相反,這些多個操作是真正的並行執行的良好候選者,而 asyncio 沒有給你。 (它的線程和進程池也是不錯的選擇)。
import multiprocessing as mp
import os
def walk_all_files(path):
for root, dirs, files in os.walk('.', topdown=True):
for file in files:
yield os.path.join(root, file)
def cleaner(path):
return "sparkly"
def clean_all(path="."):
files = list(walk_all_files(path))
# using cpu*2 assuming that there is a lot of cpu heavy
# work that can be done by some processes while others
# wait on I/O. This is only a guess.
cpu_count = min(len(files), mp.cpu_count()*2)
with mp.Pool(cpu_count) as pool:
# assuming processing is fairly long but also kindof random depending on
# file contents, setting chunksize to 1 so that subprocess gets new work
# item from parent on each round. You could set it higher to have fewer
# interactions between parent and worker.
result = pool.map(cleaner, files, chunksize=1)
if __name__ == "__main__":
clean_all(".")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.