A huge dataset with 100M records and 60K columns loaded into a Dask dataframe. Need to perform min() & max() on the entire column. Using Pandas is ruled out due to memory issues.
#Sample Dask Dataframe
import dask.dataframe as dd
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5],
'col2': [2., 3., 4., 5., 6.],
'col3': [4, 6, 8, 3, 2],
.
.
.
'col60000':[3,4,5,6,7]
})
ddf = dd.from_pandas(df, npartitions=30)
I could not use map_partitions function as it applies to the corresponding partition and not on the entire column
min_deviation = lambda x: (x - x.min())
for col in ddf.columns:
print("processing column:", col)
res = ddf[col].map_partitions(min_deviation).compute()
print(res)
Results:
processing column: col1
0 0
1 1
2 2
3 0
4 1
Name: col1, dtype: int64
processing column: col2
0 0.0
1 1.0
2 2.0
3 0.0
4 1.0
Name: col2, dtype: float64
processing column: col3
0 0
1 2
2 4
3 1
4 0
Name: col3, dtype: int64
Also Dask apply() function is not supported on columnwise operation.
Is there any other way around to perform the entire columnwise operation with Dask dataframe.
A dask dataframe has max
and min
method that work column-wise by default, and produce results from the whole data, all partitions. You can also use these results in further arithmetic with or without computing them to concrete values
df.min().compute()
- the concrete minima of each column (df - df.min())
- lazy version of what you said (df - df.min().compute())
- compute the minima up front (may be useful, depending on what you plan to do next)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.