Dask Running Out Of Memory (16GB) When using apply

Question

I am trying to perfrom some string manipulation on data (combined from 6 csvs), of about 3.5GB+(combined csv size).

**

**Total csv size : 3.5GB+,
Total Ram Size : 16GB,
Library Used   : Dask**
Shape of Combined Df : 6 Million rows and 57 columns

**

I have a method that just eliminates unwanted characters from essential columns like:

def stripper(x):
    try:
        if type(x) != float or type(x) != pd._libs.missing.NAType:
            x = re.sub(r"[^\w]+", "", x).upper()
    except Exception as ex:
        pass
    return x

And I am applying above method to certain columns as::

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].apply(stripper, axis=1, meta=df)

And also i am filling null values of a column with the values from another column as:

df["MatchSourceOwnerId"] = df["SourceOwnerId"].fillna(df["SourceKey"])

These are the two operation i need to perform and after these i am just doing.head() for getting value ( As dask work on lazy evaluation method).

temp_df = df.head(10000)

But When i do this, it keeps eating ram and my total 16 GB of ram goes to zero and the kernel dies.

How can i solve this issue?? Any help would be appreciated.

Answer 1

I'm not familiar with Dask, but it seems to me like you can use .str.replace for each column instead of a custom function for each row , and and go for a more vectorized solution:

df[["MatchCounty", "MatchZip", "SourceOwnerId", "SourceKey"]] = df[["County", "Zip", "SourceOwnerId", "SourceKey"]].dropna().apply(lambda col: col.astype(str).str.replace(r"[^\w]+", ""), meta=df)

Answer 2

To expand on @richardec's solution, in Dask you can directly use DataFrame.replace and Series.str.upper , which should be faster than using an apply . For example:

import dask.dataframe as dd
import pandas as pd

ddf = dd.from_pandas(
    pd.DataFrame(
        {'a': [1, 'kdj821', '* dk0 '],
         'b': ['!23d', 'kdj821', '* dk0 '],
         'c': ['!23d', 'kdj821', None]}),
    npartitions=2)

ddf[['a', 'b']] = ddf[['a', 'b']].replace(r"[^\w]+", r"", regex=True)
ddf['c'] = ddf['c'].fillna(ddf['a']).str.upper()
ddf.compute()

It would also be good to know how many partitions you've split the Dask DataFrame into-- each partition should fit comfortably in memory (ie < 1GB), but you also don't want to have too many (see DataFrame Best Practices in the Dask docs ).

Dask Running Out Of Memory (16GB) When using apply

Question

2 answers

solution1
2

solution2
0 2022-03-21 19:07:58

Dask Running Out Of Memory (16GB) When using apply

Question

2 answers

solution1 2

solution2 0 2022-03-21 19:07:58

solution1
2

solution2
0 2022-03-21 19:07:58