简体   繁体   English

Dask to_csv 生成无法访问的文件

[英]Dask to_csv generates inaccessible file

I'm new to Dask.我是达斯克的新手。 My motivation was to read large CSV files faster by parallelizing the process.我的动机是通过并行化过程来更快地读取大型 CSV 文件。 After reading a file, I use compute() in order to merge the parts into a single pandas df.读取文件后,我使用compute()将这些部分合并为一个 pandas df。 Then, when using pandas to_csv , the output CSV file isn't readable:然后,当使用 pandas to_csv时,输出 CSV 文件不可读:

$ file -I *.csv
my_big_file.csv:               ERROR: cannot read `my_big_file.csv' (Operation canceled)
$ head -n2 my_big_file.csv
head: Error reading my_big_file.csv

Original code looks like the following:原始代码如下所示:

import pandas as pd
import dask.dataframe as daf

filepath='/Users/coolboy/Customer Data/my_original_file.csv'
df = daf.read_csv(filepath, dtype=str, low_memory=False, encoding='utf-8-sig',error_bad_lines=False).compute() 
print('done reading')
df.to_csv('/Users/coolboy/Customer Data/my_big_file.csv',index=False)

The original motivation is to read the data into memory faster.最初的动机是更快地将数据读入内存。 Using dask is a plausible solution, but if the intention is to bring the data into memory, then there are other alternatives available also.使用dask是一个合理的解决方案,但如果打算将数据带入内存,那么还有其他可用的替代方案。 For example, modin follows pandas API and could deliver reduction proportional to the number of cores, see docs .例如, modin遵循pandas API 并且可以提供与内核数量成比例的减少,请参阅docs The code would roughly look like this:代码大致如下所示:

import modin.pandas as pd
df_in = pd.read_csv(path_in, **options)
... # potentially some additional logic
df_out.to_csv(path_out, **other_options)

If speed/memory efficiency is of primary concern and there is no data transformation happening, then the best alternative is to use shell commands or use Python-based libraries to copy file with pathlib or, if remote data is involved, fsspec .如果速度/内存效率是主要关注点并且没有发生数据转换,那么最好的选择是使用shell命令或使用基于 Python 的库来复制带有pathlib的文件,或者,如果涉及远程数据,则fsspec

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM