使用延迟 (DASK) 读取大型 CSV 文件

Question

I'm using delayed to read many large CSV files:我正在使用delayed读取许多大型 CSV 文件：

import pandas as pd

def function_1(x1, x2):         
    df_d1 = pd.read_csv(x1)
    # Some calculations on df_d1 using x2.
    return df_d1

def function_2(x3):         
    df_d2 = pd.read_csv(x3)
    return df_d2

def function_3(df_d1, df_d2):         
    # some calculations and merging data-sets (output is "merged_ds").
    return merged_ds

function_1 : importing data-set 1 and doing some calculations. function_1 ：导入数据集 1 并进行一些计算。
function_2 : importing data-set 2. function_2 ：导入数据集 2。
function_3 : merge data-sets and some calculations. function_3 ：合并数据集和一些计算。

Next, I use a loop to call these functions using delayed function.接下来，我使用循环来使用delayed函数调用这些函数。 I have many CSV files, and every file is more than 500MB.我有很多CSV文件，每个文件都超过500MB。 Is this a suitable procedure to do my tasks using DASK ( delayed )?这是使用 DASK（ delayed ）完成任务的合适程序吗？

Answer 1

Yes, please go ahead and delay your functions and submit them to Dask.是的，请继续延迟您的功能并将它们提交给 Dask。 The most memory-heavy is likely to be function_3 , and you may want to consider how many of these you can hold in memory at a time - use the distributed scheduler to control how many workers and threads you have and their respective memory limitshttps://distributed.readthedocs.io/en/latest/local-cluster.html内存最重的可能是function_3 ，您可能需要考虑一次可以在内存中保留多少个内存 - 使用分布式调度程序来控制您拥有多少工人和线程以及它们各自的内存限制https： //distributed.readthedocs.io/en/latest/local-cluster.html

Finally, you I'm sure do not want to return the final merged dataframes, that surely does not fit in memory: you probably mean to aggregate over them or write out to other files.最后，我确定您不想返回最终合并的数据帧，这肯定不适合内存：您可能想对它们进行聚合或写出到其他文件。

使用延迟 (DASK) 读取大型 CSV 文件

问题描述

1 个解决方案

解决方案1
1 2019-03-04 21:11:20

使用延迟 (DASK) 读取大型 CSV 文件

问题描述

1 个解决方案

解决方案1 1 2019-03-04 21:11:20

解决方案1
1 2019-03-04 21:11:20