简体   繁体   中英

dask.read_parquet causes OOM Error

I have been using dask to perform data cleansing on multiple csv files. This code works fine:

import pandas as pd
import glob
import os
from timeit import default_timer
from dask.distributed import Client
import dask.dataframe as dd

cols_to_keep = ["barcode", "salesdate", "storecode", "quantity", "salesvalue", "promotion", "key_row"]

col_types = {'barcode': object,
            'salesdate': object,
            'storecode': object,
            'quantity': float,
            'salesvalue': float,
            'promotion': object,
            'key_row': object}

trans = dd.read_csv(os.path.join(TRANS_PATH, "*.TXT"), 
                    sep=";", usecols=cols_to_keep, dtype=col_types, parse_dates=['salesdate'])

trans = trans[trans['barcode'].isin(barcodes)]

trans_df = trans.compute()

I decided to try out the parquet storage system since it is supposedly faster and supported by dask. After converting the csv files to .parquet using the pandas' to_parquet() method I tried the following:

cols_to_keep = ["barcode", "salesdate", "storecode", "quantity", "salesvalue", "promotion", "key_row"]

trans = dd.read_parquet(os.path.join(PARQUET_PATH, '*.parquet'), columns=cols_to_keep)

trans = trans[trans['barcode'].isin(barcodes)]

trans_df = trans.compute()

Soon after the graph starts executing, the workers run out of memory and I get multiple warnings:

distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 13620 was killed by signal 15
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget.  Restarting
distributed.nanny - WARNING - Worker process 13396 was killed by signal 15

In the end the whole program crashes. My .parquet files are not the problem, I can load these just fine using pandas' read_parquet() method. From the dask utilities I noticed that for some reason the graph tries to read everything in before performing any filtering using the .isin call: dd.read_parquet()执行图

This is not the case when dd.read_csv() is used. Here, everything runs 'in parallel' so filtering prevents the OOM:

在此处输入图片说明

Does anyone have any idea what is going on? What am I missing?

Your problem is using pandas.to_parquet() to write the data. This creates a single massive row-group out of the data, which becomes one partition when Dask reads it - Dask follows whatever partitioning is in the data. Conversely, Dask automatically partitions CSV input, not assuming that the data has inherent partitioning.

Since you are already using Dask, you should use it for writing the parquet data too, using dask.DataFrame.to_parquet, the analogue of the Pandas method. It will produce multiple files in a directory, which will be read independently and in parallel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM