简体   繁体   English

为什么 Dask to_csv 部分保存文件?

[英]Why is Dask to_csv saving files in parts?

Preamble for context:上下文的序言:

I have a sample csv file that has more columns than rows (~300 vs 190), I'm trying to learn how it all works before working with the whole 80 million records.我有一个示例 csv 文件,它的列多于行(~300 vs 190),我试图在处理全部 8000 万条记录之前了解它是如何工作的。 I'm working on a google colab notebook.我正在使用谷歌 colab 笔记本。

What I'm trying to do:我正在尝试做的事情:

Read a CSV file, execute a value_counts() for all the columns and save the results读取 CSV 文件,对所有列执行 value_counts() 并保存结果

Here's the code, I left it as is:这是代码,我保持原样:

import dask.dataframe as dd
import pandas as pd

# Here we're reading the csv
dfd = dd.read_csv(
    'drive/MyDrive/csvs/sample.csv', 
    delimiter=';',

    # Down below we specify the types of the first columns
    dtype = {'ID': object, 'BSID': 'UInt32', 'CAM': 'UInt32',
                  'AGZ': 'UInt32', 'Zen': 'UInt16', 'taw': 'UInt16'
                  },
    blocksize=64000000 # = 64 Mb chunks
)

# Here we convert the rest of the ~300 columns to UInt8

cols=[i for i in dfd.columns if i not in ['ID', 'BSID', 'CAM',
                  'AGZ', 'Zen', 'taw']]
for col in cols:
    dfd[col]=dfd[col].astype('UInt8')

# value_counts
for col in dfd.columns:
  result = dfd[col].value_counts()
  result.to_csv('drive/MyDrive/csvs/Value_counts-' + col + '.csv')

What's going wrong:出了什么问题:

When the code is executed, the results are stored as files named 0.part in folders that are named following the Value_counts-' + col + '.csv schema.执行代码时,结果将作为名为 0.part 的文件存储在以Value_counts-' + col + '.csv模式命名的文件夹中。 I expect it to be saved as Value_counts-' + col + '.csv files in csvs folder.我希望它在 csvs 文件夹中保存为Value_counts-' + col + '.csv csvs

Why is this happening?为什么会这样?

Additional question:附加问题:

Can I run value_counts() for all columns in a better way?我可以以更好的方式为所有列运行 value_counts() 吗?

See the doc文档

single_file:bool, default False single_file:bool,默认为 False

Whether to save everything into a single CSV file.是否将所有内容保存到单个 CSV 文件中。 Under the single file mode, each partition is appended at the end of the specified CSV file.在单文件模式下,每个分区都附加在指定的 CSV 文件的末尾。

In your case you only have one partition (part.0) for each output - but Dask doesn't know that you don't need parallel writing from multiple chunks, so you need to help it.在您的情况下,每个 output 只有一个分区(part.0) - 但 Dask 不知道您不需要从多个块并行写入,因此您需要帮助它。

Is there a better way?有没有更好的办法? Well, it sounds like you have many more columns than partitions, so you could do dfd.map_partitions(pd.DataFrame.value_counts) and sum the pieces.好吧,听起来您的列比分区多得多,因此您可以执行dfd.map_partitions(pd.DataFrame.value_counts)并将各个部分sum

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM