[英]Why does my code take so long to write CSV file in Dask Python
Below is my Python code: 下面是我的Python代码:
import dask.dataframe as dd
VALUE2015 = dd.read_csv('A/SKD - M2M by Salesman (value by uom) (NEWSALES)2015-2016.csv', usecols = VALUEFY, dtype = traintypes1)
REPORT = VALUE2015.groupby(index).agg({'JAN':'sum', 'FEB':'sum', 'MAR':'sum', 'APR':'sum', 'MAY':'sum','JUN':'sum', 'JUL':'sum', 'AUG':'sum', 'SEP':'sum', 'OCT':'sum', 'NOV':'sum', 'DEC':'sum'}).compute()
REPORT.to_csv('VALUE*.csv', header=True)
It takes 6 minutes to create a 100MB CSV file. 创建100MB CSV文件需要6分钟。
Looking through Dask documentation, it says there that, "generally speaking, Dask.dataframe groupby-aggregations are roughly same performance as Pandas groupby-aggregations." 查看Dask文档时,它说:“一般而言,Dask.dataframe groupby聚合与Pandas groupby聚合大致具有相同的性能。” So unless you're using a Dask distributed client to manage workers, threads, etc., the benefit from using it over vanilla Pandas isn't always there.
因此,除非您使用Dask分布式客户端来管理工作程序,线程等,否则使用它取代普通Pandas的好处并不总是存在的。
Also, try to time each step in your code because if the bulk of the 6 minutes is taken up by writing the .CSV to file on disk, then again Dask will be of no help (for a single file). 另外,请尝试对代码中的每个步骤进行计时,因为如果6分钟的大部分时间是通过将.CSV写入磁盘上的文件来完成的,那么Dask再也无济于事(对于单个文件)。
Here 'sa nice tutorial from Dask on adding distributed schedulers for your tasks. 这是来自Dask的不错的教程,关于为您的任务添加分布式调度程序。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.