简体   繁体   English

根据频率,用列中的随机值填充 DataFrame 的 NaN 值

[英]Fill NaN values of DataFrame with random values from the column, depending on frequency

I am trying to fill a pandas dataframe NAN using random data of every column, and that random data appears in every column depeding on its frecuency.我正在尝试使用每列的随机数据填充 pandas dataframe NAN,并且随机数据出现在每列中取决于其频率。 I have this:我有这个:

def MissingRandom(dataframe):
        import random
        dataframe = dataframe.apply(lambda x: x.fillna(
                random.choices(x.value_counts().keys(),
                               weights = list(x.value_counts()))[0]))
    return dataframe

I get the DataFrame filled in with random data but its the same data for all the missing data of the column.我得到填充了随机数据的 DataFrame,但它是列中所有缺失数据的相同数据 I would like this data to be different for every missing of the column but I am not able to do it.我希望这个数据对于每一个列的缺失都是不同的,但我做不到。 Could anybody help me?有人可以帮助我吗?

Thank you very much非常感谢你

Please see below my solution.请看下面我的解决方案。 Firstly i created a function that fills a series based on your criteria (frequencies as weights in the random function) and finally, we apply this function to all clumns of the dataframe:首先,我创建了一个 function,它根据您的标准(频率作为随机函数中的权重)填充一个系列,最后,我们将此 function 应用于 dataframe 的所有列:

from collections import Counter
def fillcolumn(ser):
        cna=len(ser[ser.isna()])
        l=ser[ser.notna()]
        d=Counter(l)    
        m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
        ser[ser.isna()]=m
        return ser
    
for i in df.columns:
    df[i]=fillcolumn(df[i])

Your full code:您的完整代码:

def MissingRandom(dataframe):
    import random
    from collections import Counter
    def fillcolumn(ser):
        cna=len(ser[ser.isna()])
        l=ser[ser.notna()]
        d=Counter(l)    
        m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
        ser[ser.isna()]=m
        return ser
        
    for i in dataframe.columns:
        dataframe[i]=fillcolumn(dataframe[i])
    return dataframe

Here are two thoughts on the (interesting.) subject.这是关于(有趣的)主题的两个想法。

  • Create a replace function and call apply创建一个替换 function 并调用apply
  • Use fillna(method='ffill')使用fillna(method='ffill')

Replace function:替换 function:

Setup:设置:

df = pd.DataFrame({'a': [1, np.nan, 3, 4, np.nan],
                   'b': [np.nan, 12, np.nan, np.nan, 15],
                   'c': [21, np.nan, np.nan, 24, 25],
                   'd': [31, np.nan, np.nan, 34, 34]})

Example function:示例 function:

def replace_na(x):
    """Replace NaN values with values randomly selected from the Series."""
    vc = x.value_counts()
    r = random.choices(vc.keys(), weights=vc.values, k=x.isnull().sum())
    x[x.isnull()] = r
    return x

Apply:申请:

df.apply(lambda x: replace_na(x))

Output: Output:

     a     b     c     d
0  1.0  12.0  21.0  31.0
1  4.0  12.0  25.0  34.0
2  3.0  15.0  21.0  34.0
3  4.0  15.0  24.0  34.0
4  1.0  15.0  25.0  34.0

A different thought:一个不同的想法:

A different thought process... as problem solving is about looking at different angles.不同的思维过程……因为解决问题是从不同的角度看问题。

I acknowledge that this approach does not meet the OP's specific request - but perhaps meets the underlying intent .我承认这种方法不符合 OP 的特定要求- 但可能符合基本意图

If filling NaN values with a random value from the column, it might be simpler (and equally as effective) to forward-fill the empty values.如果用列中的随机值填充NaN值,则前向填充空值可能更简单(同样有效)。 This would also address the frequency, as more-common values are likely to be followed by a missing value, than a less-common value.这也将解决频率问题,因为与不太常见的值相比,更常见的值后面可能会跟着缺失值。

df.fillna(method='ffill')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM