简体   繁体   中英

Handling large, compressed csv files with Dask

the setup is that I have eight large csv files (32GB each) which are compressed with Zip to 8GB files each. I cannot work with the uncompressed data as I want to save disk space and do not have 32*8GB space left. I cannot load one file with eg pandas because it does not fit into memory.

I thought Dask is a reasonable choice for the task, but feel free to suggest a different tool if you think it suits the purpose.

Is it possible to process one 8GB compressed file with Dask by reading multiple chunks of the compressed file in parallel, process each chunk and save the results to disk?

The first problem is that Dask does not support .zip . This issue proposes to use dask.delayed , but it would also be possible for me to change the format to .xz or something else.

Second, and probably related to the choice of the compression format is whether it is possible to access only parts of the compressed file in parallel.

Or is it better to split each uncompressed csv file into smaller parts which fit into memory and then process the recompressed smaller parts with something like this:

import dask.dataframe as dd

df = dd.from_csv('files_*.csv.xz', compression='xz')

For now, I would prefer something similar to the first solution which seems to be leaner, but I might be totally mistaken as this domain is new to me.

Thanks for your help!

The easiest solution is certainly to stream your large files into several compressed files each (remember to end each file on a newline!), and then load those with Dask as you suggest. Each smaller file will become one dataframe partition in memory, so as long as the files are small enough, you will not run out of memory as you process the data with Dask.

The fundamental reason here, is that a format list bz2, gz or zip does not have allow random-access, the only way to read the data is from the start of the data. xz is the only format that allows for block-wise compression within a file, so, in principle, it would be possible to load block-wise, which is not quite the same as real random-access. That would do what you are after. However, this pattern is really the very same as having separate files, so not worth the extra effort to write the files in blocking mode (not the default) and use functions dask.bytes.compression.get_xz_blocks, xz_decompress , which are not currently used for anything in the codebase.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM