为什么我的代码需要这么长时间才能在Dask Python中写入CSV文件

Question

Below is my Python code: 下面是我的Python代码：

import dask.dataframe as dd

VALUE2015 = dd.read_csv('A/SKD - M2M by Salesman (value by uom) (NEWSALES)2015-2016.csv', usecols = VALUEFY, dtype = traintypes1) 

REPORT = VALUE2015.groupby(index).agg({'JAN':'sum', 'FEB':'sum', 'MAR':'sum', 'APR':'sum', 'MAY':'sum','JUN':'sum', 'JUL':'sum', 'AUG':'sum', 'SEP':'sum', 'OCT':'sum', 'NOV':'sum', 'DEC':'sum'}).compute()

REPORT.to_csv('VALUE*.csv', header=True)

It takes 6 minutes to create a 100MB CSV file. 创建100MB CSV文件需要6分钟。

Answer 1

Looking through Dask documentation, it says there that, "generally speaking, Dask.dataframe groupby-aggregations are roughly same performance as Pandas groupby-aggregations." 查看Dask文档时，它说：“一般而言，Dask.dataframe groupby聚合与Pandas groupby聚合大致具有相同的性能。” So unless you're using a Dask distributed client to manage workers, threads, etc., the benefit from using it over vanilla Pandas isn't always there. 因此，除非您使用Dask分布式客户端来管理工作程序，线程等，否则使用它取代普通Pandas的好处并不总是存在的。

Also, try to time each step in your code because if the bulk of the 6 minutes is taken up by writing the .CSV to file on disk, then again Dask will be of no help (for a single file). 另外，请尝试对代码中的每个步骤进行计时，因为如果6分钟的大部分时间是通过将.CSV写入磁盘上的文件来完成的，那么Dask再也无济于事（对于单个文件）。

Here 'sa nice tutorial from Dask on adding distributed schedulers for your tasks. 这是来自Dask的不错的教程，关于为您的任务添加分布式调度程序。

为什么我的代码需要这么长时间才能在Dask Python中写入CSV文件

问题描述

1 个解决方案

解决方案1
1 2019-01-17 11:21:25

为什么我的代码需要这么长时间才能在Dask Python中写入CSV文件

问题描述

1 个解决方案

解决方案1 1 2019-01-17 11:21:25

解决方案1
1 2019-01-17 11:21:25