Dask DataFrame: apply custom function to the entire Column, involving min(), max()

Question

A huge dataset with 100M records and 60K columns loaded into a Dask dataframe. Need to perform min() & max() on the entire column. Using Pandas is ruled out due to memory issues.

#Sample Dask Dataframe
import dask.dataframe as dd
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5],
                    'col2': [2., 3., 4., 5., 6.],
                    'col3': [4, 6, 8, 3, 2],
                     .
                     .
                     .
                    'col60000':[3,4,5,6,7]
                  })
ddf = dd.from_pandas(df, npartitions=30)

I could not use map_partitions function as it applies to the corresponding partition and not on the entire column

min_deviation = lambda x: (x - x.min())

for col in ddf.columns:
    print("processing column:", col)
    res = ddf[col].map_partitions(min_deviation).compute()
    print(res)

Results:
processing column: col1
0    0
1    1
2    2
3    0
4    1
Name: col1, dtype: int64
processing column: col2
0    0.0
1    1.0
2    2.0
3    0.0
4    1.0
Name: col2, dtype: float64
processing column: col3
0    0
1    2
2    4
3    1
4    0
Name: col3, dtype: int64

Also Dask apply() function is not supported on columnwise operation.

Is there any other way around to perform the entire columnwise operation with Dask dataframe.

Answer 1

A dask dataframe has max and min method that work column-wise by default, and produce results from the whole data, all partitions. You can also use these results in further arithmetic with or without computing them to concrete values

df.min().compute() - the concrete minima of each column
(df - df.min()) - lazy version of what you said
(df - df.min().compute()) - compute the minima up front (may be useful, depending on what you plan to do next)

Dask DataFrame: apply custom function to the entire Column, involving min(), max()

Question

1 answers

solution1
1 ACCPTED 2020-06-22 16:48:05

Dask DataFrame: apply custom function to the entire Column, involving min(), max()

Question

1 answers

solution1 1 ACCPTED 2020-06-22 16:48:05

solution1
1 ACCPTED 2020-06-22 16:48:05