[英]Fill NaNs with per-column max in dask dataframe
I need to impute in a dataframe the maximum number in each column when the value is np.nan
.当值为
np.nan
时,我需要在 dataframe 中估算每列中的最大数量。 Unfortunatelly in SimpleImputer this strategy is not supported according to the documentation:不幸的是,在 SimpleImputer 中,根据文档不支持此策略:
https://ml.dask.org/modules/generated/dask_ml.impute.SimpleImputer.html https://ml.dask.org/modules/generated/dask_ml.impute.SimpleImputer.html
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
So I'm trying to do this manually with fillna
.因此,我尝试使用
fillna
手动执行此操作。 This is my attempt:这是我的尝试:
df = pd.DataFrame({
'height': [6.21, 5.12, 5.85, 5.78, 5.98, np.nan],
'weight': [np.nan, 150, 126, 133, 164, 203]
})
df_dask = dd.from_pandas(df, npartitions=2)
meta = [('height', 'float'),('weight', 'float')]
df_dask = df_dask.apply(lambda x: x.fillna(x.max()), axis=1, meta=meta)
df_dask.compute()
height weight
0 6.21 6.21
1 5.12 150.00
2 5.85 126.00
3 5.78 133.00
4 5.98 164.00
5 203.00 203.00
I'm using axis=1
to work by column however dask is taking the max of the row.我正在使用
axis=1
按列工作,但是dask正在占用该行的最大值。 How to fix this?如何解决这个问题?
The axis argument works the same way in dask.dataframe as it does in pandas - axis=0
applies a function column-wise in pandas too: The axis argument works the same way in dask.dataframe as it does in pandas -
axis=0
applies a function column-wise in pandas too:
In [9]: df.apply(lambda x: x.fillna(x.max()), axis=0)
Out[9]:
height weight
0 6.21 203.0
1 5.12 150.0
2 5.85 126.0
3 5.78 133.0
4 5.98 164.0
5 6.21 203.0
However, in dask.dataframe, you cannot currently apply a function column-wise.但是,在 dask.dataframe 中,您目前不能按列应用 function。 See the
dask.dataframe.apply
docs:请参阅
dask.dataframe.apply
文档:
Parallel version of
pandas.DataFrame.apply
并行版
pandas.DataFrame.apply
This mimics the pandas version except for the following:
这模仿了 pandas 版本,但以下内容除外:
- Only
axis=1
is supported (and must be specified explicitly).仅支持
axis=1
(并且必须明确指定)。- The user should provide output metadata via the meta keyword.
用户应通过 meta 关键字提供 output 元数据。
However, you could easily do this without an apply:但是,您无需应用即可轻松执行此操作:
In [19]: df_dask.fillna(df_dask.max(), axis=0).compute()
Out[19]:
height weight
0 6.21 203.0
1 5.12 150.0
2 5.85 126.0
3 5.78 133.0
4 5.98 164.0
5 6.21 203.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.