根据频率，用列中的随机值填充 DataFrame 的 NaN 值

Question

I am trying to fill a pandas dataframe NAN using random data of every column, and that random data appears in every column depeding on its frecuency.我正在尝试使用每列的随机数据填充 pandas dataframe NAN，并且随机数据出现在每列中取决于其频率。 I have this:我有这个：

def MissingRandom(dataframe):
        import random
        dataframe = dataframe.apply(lambda x: x.fillna(
                random.choices(x.value_counts().keys(),
                               weights = list(x.value_counts()))[0]))
    return dataframe

I get the DataFrame filled in with random data but its the same data for all the missing data of the column.我得到填充了随机数据的 DataFrame，但它是列中所有缺失数据的相同数据。 I would like this data to be different for every missing of the column but I am not able to do it.我希望这个数据对于每一个列的缺失都是不同的，但我做不到。 Could anybody help me?有人可以帮助我吗？

Thank you very much非常感谢你

Answer 1

Please see below my solution.请看下面我的解决方案。 Firstly i created a function that fills a series based on your criteria (frequencies as weights in the random function) and finally, we apply this function to all clumns of the dataframe:首先，我创建了一个 function，它根据您的标准（频率作为随机函数中的权重）填充一个系列，最后，我们将此 function 应用于 dataframe 的所有列：

from collections import Counter
def fillcolumn(ser):
        cna=len(ser[ser.isna()])
        l=ser[ser.notna()]
        d=Counter(l)    
        m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
        ser[ser.isna()]=m
        return ser
    
for i in df.columns:
    df[i]=fillcolumn(df[i])

Your full code:您的完整代码：

def MissingRandom(dataframe):
    import random
    from collections import Counter
    def fillcolumn(ser):
        cna=len(ser[ser.isna()])
        l=ser[ser.notna()]
        d=Counter(l)    
        m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
        ser[ser.isna()]=m
        return ser
        
    for i in dataframe.columns:
        dataframe[i]=fillcolumn(dataframe[i])
    return dataframe

Answer 2

Here are two thoughts on the (interesting.) subject.这是关于（有趣的）主题的两个想法。

Create a replace function and call apply创建一个替换 function 并调用apply
Use fillna(method='ffill')使用fillna(method='ffill')

Replace function:替换 function：

Setup:设置：

df = pd.DataFrame({'a': [1, np.nan, 3, 4, np.nan],
                   'b': [np.nan, 12, np.nan, np.nan, 15],
                   'c': [21, np.nan, np.nan, 24, 25],
                   'd': [31, np.nan, np.nan, 34, 34]})

Example function:示例 function：

def replace_na(x):
    """Replace NaN values with values randomly selected from the Series."""
    vc = x.value_counts()
    r = random.choices(vc.keys(), weights=vc.values, k=x.isnull().sum())
    x[x.isnull()] = r
    return x

Apply:申请：

df.apply(lambda x: replace_na(x))

Output: Output：

     a     b     c     d
0  1.0  12.0  21.0  31.0
1  4.0  12.0  25.0  34.0
2  3.0  15.0  21.0  34.0
3  4.0  15.0  24.0  34.0
4  1.0  15.0  25.0  34.0

A different thought:一个不同的想法：

A different thought process... as problem solving is about looking at different angles.不同的思维过程……因为解决问题是从不同的角度看问题。

I acknowledge that this approach does not meet the OP's specific request - but perhaps meets the underlying intent .我承认这种方法不符合 OP 的特定要求- 但可能符合基本意图。

If filling NaN values with a random value from the column, it might be simpler (and equally as effective) to forward-fill the empty values.如果用列中的随机值填充NaN值，则前向填充空值可能更简单（同样有效）。 This would also address the frequency, as more-common values are likely to be followed by a missing value, than a less-common value.这也将解决频率问题，因为与不太常见的值相比，更常见的值后面可能会跟着缺失值。

df.fillna(method='ffill')

根据频率，用列中的随机值填充 DataFrame 的 NaN 值

问题描述

2 个解决方案

解决方案1
2 2020-12-01 19:00:52

解决方案2
0 2020-12-01 19:21:29

Replace function:替换 function：

A different thought:一个不同的想法：

根据频率，用列中的随机值填充 DataFrame 的 NaN 值

问题描述

2 个解决方案

解决方案1 2 2020-12-01 19:00:52

解决方案2 0 2020-12-01 19:21:29

Replace function:替换 function：

A different thought:一个不同的想法：

解决方案1
2 2020-12-01 19:00:52

解决方案2
0 2020-12-01 19:21:29