dask.read_parquet causes OOM Error

Question

I have been using dask to perform data cleansing on multiple csv files. This code works fine:

import pandas as pd
import glob
import os
from timeit import default_timer
from dask.distributed import Client
import dask.dataframe as dd

cols_to_keep = ["barcode", "salesdate", "storecode", "quantity", "salesvalue", "promotion", "key_row"]

col_types = {'barcode': object,
            'salesdate': object,
            'storecode': object,
            'quantity': float,
            'salesvalue': float,
            'promotion': object,
            'key_row': object}

trans = dd.read_csv(os.path.join(TRANS_PATH, "*.TXT"), 
                    sep=";", usecols=cols_to_keep, dtype=col_types, parse_dates=['salesdate'])

trans = trans[trans['barcode'].isin(barcodes)]

trans_df = trans.compute()

I decided to try out the parquet storage system since it is supposedly faster and supported by dask. After converting the csv files to .parquet using the pandas' to_parquet() method I tried the following:

cols_to_keep = ["barcode", "salesdate", "storecode", "quantity", "salesvalue", "promotion", "key_row"]

trans = dd.read_parquet(os.path.join(PARQUET_PATH, '*.parquet'), columns=cols_to_keep)

trans = trans[trans['barcode'].isin(barcodes)]

trans_df = trans.compute()

Soon after the graph starts executing, the workers run out of memory and I get multiple warnings:

distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 13620 was killed by signal 15
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 13396 was killed by signal 15

In the end the whole program crashes. My .parquet files are not the problem, I can load these just fine using pandas' read_parquet() method. From the dask utilities I noticed that for some reason the graph tries to read everything in before performing any filtering using the .isin call:

This is not the case when dd.read_csv() is used. Here, everything runs 'in parallel' so filtering prevents the OOM:

Does anyone have any idea what is going on? What am I missing?

Answer 1

Your problem is using pandas.to_parquet() to write the data. This creates a single massive row-group out of the data, which becomes one partition when Dask reads it - Dask follows whatever partitioning is in the data. Conversely, Dask automatically partitions CSV input, not assuming that the data has inherent partitioning.

Since you are already using Dask, you should use it for writing the parquet data too, using dask.DataFrame.to_parquet, the analogue of the Pandas method. It will produce multiple files in a directory, which will be read independently and in parallel.

dask.read_parquet causes OOM Error

Question

1 answers

solution1
3 ACCPTED 2018-08-08 15:40:33

dask.read_parquet causes OOM Error

Question

1 answers

solution1 3 ACCPTED 2018-08-08 15:40:33

solution1
3 ACCPTED 2018-08-08 15:40:33