简体   繁体   English

用dask dataframe中的每列最大值填充NaN

[英]Fill NaNs with per-column max in dask dataframe

I need to impute in a dataframe the maximum number in each column when the value is np.nan .当值为np.nan时,我需要在 dataframe 中估算每列中的最大数量。 Unfortunatelly in SimpleImputer this strategy is not supported according to the documentation:不幸的是,在 SimpleImputer 中,根据文档支持此策略:

https://ml.dask.org/modules/generated/dask_ml.impute.SimpleImputer.html https://ml.dask.org/modules/generated/dask_ml.impute.SimpleImputer.html

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

So I'm trying to do this manually with fillna .因此,我尝试使用fillna手动执行此操作。 This is my attempt:这是我的尝试:

df = pd.DataFrame({
    'height':  [6.21, 5.12, 5.85, 5.78, 5.98, np.nan],
    'weight': [np.nan, 150, 126, 133, 164, 203]
})

df_dask = dd.from_pandas(df, npartitions=2) 
meta = [('height', 'float'),('weight', 'float')]
df_dask = df_dask.apply(lambda x: x.fillna(x.max()), axis=1, meta=meta)

df_dask.compute()

    height  weight
0   6.21    6.21
1   5.12    150.00
2   5.85    126.00
3   5.78    133.00
4   5.98    164.00
5   203.00  203.00

I'm using axis=1 to work by column however dask is taking the max of the row.我正在使用axis=1按列工作,但是dask正在占用该行的最大值。 How to fix this?如何解决这个问题?

The axis argument works the same way in dask.dataframe as it does in pandas - axis=0 applies a function column-wise in pandas too: The axis argument works the same way in dask.dataframe as it does in pandas - axis=0 applies a function column-wise in pandas too:

In [9]: df.apply(lambda x: x.fillna(x.max()), axis=0)
Out[9]:
   height  weight
0    6.21   203.0
1    5.12   150.0
2    5.85   126.0
3    5.78   133.0
4    5.98   164.0
5    6.21   203.0

However, in dask.dataframe, you cannot currently apply a function column-wise.但是,在 dask.dataframe 中,您目前不能按列应用 function。 See the dask.dataframe.apply docs:请参阅dask.dataframe.apply文档:

Parallel version of pandas.DataFrame.apply并行版pandas.DataFrame.apply

This mimics the pandas version except for the following:这模仿了 pandas 版本,但以下内容除外:

  • Only axis=1 is supported (and must be specified explicitly).仅支持axis=1 (并且必须明确指定)。
  • The user should provide output metadata via the meta keyword.用户应通过 meta 关键字提供 output 元数据。

However, you could easily do this without an apply:但是,您无需应用即可轻松执行此操作:

In [19]: df_dask.fillna(df_dask.max(), axis=0).compute()
Out[19]:
   height  weight
0    6.21   203.0
1    5.12   150.0
2    5.85   126.0
3    5.78   133.0
4    5.98   164.0
5    6.21   203.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM