简体   繁体   English

对于数据框中的每个列和单元格,使用该列中的随机值填写NaN / Null

[英]For every column and cell in dataframe fill in NaNs/Nulls with random value from that column

I am trying to fill in NaN/null values for every column and every cell within a column in dataframe by random sampling from that column (eg sample non-NaN value). 我试图通过从该列中随机抽样(例如,抽样非NaN值)来填充数据帧中列的每个列和每个单元格的NaN / null值。 I am doing right now the following 我现在正在做以下

   for col in df:
        count = 0
        while True:
            sample = df[col].sample(n=1)
            count += 1
            if pd.notna(sample.item()):
                df[col].replace(sample, np.nan, inplace=True)
                break
            if count >= 100:
                break

Which is incorrect because: 这是不正确的,因为:

  1. it has this hack to try to sample 100 times with a hope that you finally find non-NaN within 100 tries. 它可以尝试100次采样,希望您最终在100次尝试中找到非NaN。

  2. It would fill cells with the sample sample, while I would like to sample randomly a value for every cell separately eg not to have any skew 它将用样本样品填充单元格,而我想分别为每个单元格随机采样一个值,例如不存在任何偏差

  3. Well it does not work for some reason in any case, resulting df has NaNs as before. 好吧,无论如何由于某种原因它都不起作用,导致df像以前一样具有NaN。

Note: dataframe contains both numbers and strings 注意:数据框同时包含数字和字符串

You could use np.random.choice to generate a sample from a population of values: 您可以使用np.random.choice从一组值中生成一个样本:

sample = np.random.choice(pop, size=len(df)-len(pop), replace=True)

For example, 例如,

import numpy as np
import pandas as pd

arr = np.random.randint(10, size=(10,3)).astype(float)
mask = np.random.randint(2, size=arr.shape, dtype=bool)
arr[mask] = np.nan
df = pd.DataFrame(arr)
print(df)
#      0    1    2
# 0  8.0  NaN  0.0
# 1  1.0  3.0  2.0
# 2  NaN  NaN  NaN
# 3  6.0  NaN  7.0
# 4  NaN  8.0  5.0
# 5  1.0  4.0  6.0
# 6  NaN  NaN  NaN
# 7  NaN  NaN  NaN
# 8  8.0  NaN  NaN
# 9  5.0  NaN  2.0

for col in df:
    mask = pd.isnull(df[col])
    pop = df[col].dropna()
    if len(pop):
        sample = np.random.choice(pop, size=len(df)-len(pop), replace=True)
        df.loc[mask, col] = sample


print(df)

yields a result such as 产生如下结果

     0    1    2
0  8.0  4.0  0.0
1  1.0  3.0  2.0
2  1.0  8.0  2.0
3  6.0  3.0  7.0
4  8.0  8.0  5.0
5  1.0  4.0  6.0
6  1.0  8.0  2.0
7  8.0  4.0  6.0
8  8.0  4.0  7.0
9  5.0  3.0  2.0

df[col] returns a Series. df[col]返回一个Series。 Modifying this Series is not guaranteed to modify df itself. 不能保证修改此系列可以修改df本身。 Thus 从而

df[col].replace(sample, np.nan, inplace=True)

modifies the Series returned by df[col] but fails to modify df . 修改df[col]返回的Series,但无法修改df

Generally, to ensure that you modify a DataFrame, use df.loc[...] = ... or df.iloc[...] = ... or generate a new DataFrame and reassign it to df (eg df = new_df ), or generate a new column of values and reassign it to a column (eg df[col] = values ). 通常,要确保您修改了DataFrame,请使用df.loc[...] = ...df.iloc[...] = ...或生成新的DataFrame并将其重新分配给df (例如df = new_df ),或生成一个新的value列并将其重新分配给一列(例如df[col] = values )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM