Dask DataFrame：将自定义 function 应用于整个 Column，涉及 min()、max()

Question

A huge dataset with 100M records and 60K columns loaded into a Dask dataframe.一个包含 100M 记录和 60K 列的巨大数据集加载到 Dask dataframe 中。 Need to perform min() & max() on the entire column.需要对整个列执行 min() & max()。 Using Pandas is ruled out due to memory issues.由于 memory 问题，已排除使用 Pandas。

#Sample Dask Dataframe
import dask.dataframe as dd
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5],
                    'col2': [2., 3., 4., 5., 6.],
                    'col3': [4, 6, 8, 3, 2],
                     .
                     .
                     .
                    'col60000':[3,4,5,6,7]
                  })
ddf = dd.from_pandas(df, npartitions=30)

I could not use map_partitions function as it applies to the corresponding partition and not on the entire column我无法使用 map_partitions function 因为它适用于相应的分区而不是整个列

min_deviation = lambda x: (x - x.min())

for col in ddf.columns:
    print("processing column:", col)
    res = ddf[col].map_partitions(min_deviation).compute()
    print(res)

Results:
processing column: col1
0    0
1    1
2    2
3    0
4    1
Name: col1, dtype: int64
processing column: col2
0    0.0
1    1.0
2    2.0
3    0.0
4    1.0
Name: col2, dtype: float64
processing column: col3
0    0
1    2
2    4
3    1
4    0
Name: col3, dtype: int64

Also Dask apply() function is not supported on columnwise operation.此外，按列操作不支持 Dask apply() function。

Is there any other way around to perform the entire columnwise operation with Dask dataframe.有没有其他方法可以使用 Dask dataframe 执行整个列操作。

Answer 1

A dask dataframe has max and min method that work column-wise by default, and produce results from the whole data, all partitions. dask dataframe 具有默认按列工作的max和min方法，并从整个数据、所有分区产生结果。 You can also use these results in further arithmetic with or without computing them to concrete values您还可以将这些结果用于进一步的算术运算，无论是否将它们计算为具体值

df.min().compute() - the concrete minima of each column df.min().compute() - 每列的具体最小值
(df - df.min()) - lazy version of what you said (df - df.min()) - 你所说的懒惰版本
(df - df.min().compute()) - compute the minima up front (may be useful, depending on what you plan to do next) (df - df.min().compute()) - 预先计算最小值（可能有用，取决于您接下来打算做什么）

Dask DataFrame：将自定义 function 应用于整个 Column，涉及 min()、max()

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-22 16:48:05

Dask DataFrame：将自定义 function 应用于整个 Column，涉及 min()、max()

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-22 16:48:05

解决方案1
1 已采纳 2020-06-22 16:48:05