将大型 dask 数据帧写入文件

Question

I have a large BCP file (12GB) that I have imported into dask and did some data wrangling that I wish to import to SQL server.我有一个很大的 BCP 文件（12GB），我已经将它导入到 dask 并做了一些我希望导入到 SQL 服务器的数据整理。 The file has been reduced from 40+ columns to 8 columns and I wish to find the best method to import to SQL server.该文件已从 40 多列减少到 8 列，我希望找到导入 SQL 服务器的最佳方法。 I have tried using the following:我尝试使用以下方法：

import sqlalchemy as sa
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from urllib.parse import quote_plus

pbar = ProgressBar()
pbar.register()
#windows authentication 
#to_sql_uri = quote_plus(engine)
ddf.to_sql('test', 
           uri='mssql+pyodbc://TEST_SERVER/TEST_DB?driver=SQL Server?Trusted_Connection=yes', if_exists='replace', index=False)

This method is taking too long (3 days and counting).此方法花费的时间太长（3 天和计数）。 I had suspected this may be the case, so I also tried to write to a BCP file with the intention of using SQL BCP, but again this is taking a number of days:我怀疑可能是这种情况，因此我还尝试写入 BCP 文件以使用 SQL BCP，但这又需要几天时间：

df_train_grouped.compute().to_csv("F:\TEST_FILE.bcp", sep='\t')

I am relatively new to dask and can't seem to find an easy to follow example on the most efficient method to do this.我对 dask 比较陌生，似乎无法找到一个易于遵循的示例来说明执行此操作的最有效方法。

Answer 1

There is no need for you to use compute , this materialises the dataframe into memory and is likely the bottleneck for you.您无需使用compute ，这会将数据帧具体化到内存中，这可能是您的瓶颈。 You can instead do你可以做

df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t')

which will create a number of output files in parallel - which is probably exactly what you want.这将并行创建许多输出文件 - 这可能正是您想要的。

Note that profiling will determine whether your process is IO bound (eg, by the disc itself), in which case there is nothing you can do, or whether one of the process-based schedulers (ideally the distributed scheduler) can help with GIL-holding tasks.请注意，分析将确定您的进程是否受 IO 限制（例如，受磁盘本身的约束），在这种情况下您无能为力，或者基于进程的调度程序之一（理想情况下是分布式调度程序）是否可以帮助 GIL-持任务。

Changing to a multiprocessing scheduler as follows improved performance in this particular case:在这种特殊情况下，更改为多处理调度程序如下提高了性能：

dask.config.set(scheduler='processes')  # overwrite default with multiprocessing scheduler
df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t', chunksize=1000000)

将大型 dask 数据帧写入文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-09-28 14:42:01

将大型 dask 数据帧写入文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-09-28 14:42:01

解决方案1
2 已采纳 2020-09-28 14:42:01