[英]Fill NaN values of DataFrame with random values from the column, depending on frequency
I am trying to fill a pandas dataframe NAN using random data of every column, and that random data appears in every column depeding on its frecuency.我正在尝试使用每列的随机数据填充 pandas dataframe NAN,并且随机数据出现在每列中取决于其频率。 I have this:
我有这个:
def MissingRandom(dataframe):
import random
dataframe = dataframe.apply(lambda x: x.fillna(
random.choices(x.value_counts().keys(),
weights = list(x.value_counts()))[0]))
return dataframe
I get the DataFrame filled in with random data but its the same data for all the missing data of the column.我得到填充了随机数据的 DataFrame,但它是列中所有缺失数据的相同数据。 I would like this data to be different for every missing of the column but I am not able to do it.
我希望这个数据对于每一个列的缺失都是不同的,但我做不到。 Could anybody help me?
有人可以帮助我吗?
Thank you very much非常感谢你
Please see below my solution.请看下面我的解决方案。 Firstly i created a function that fills a series based on your criteria (frequencies as weights in the random function) and finally, we apply this function to all clumns of the dataframe:
首先,我创建了一个 function,它根据您的标准(频率作为随机函数中的权重)填充一个系列,最后,我们将此 function 应用于 dataframe 的所有列:
from collections import Counter
def fillcolumn(ser):
cna=len(ser[ser.isna()])
l=ser[ser.notna()]
d=Counter(l)
m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
ser[ser.isna()]=m
return ser
for i in df.columns:
df[i]=fillcolumn(df[i])
Your full code:您的完整代码:
def MissingRandom(dataframe):
import random
from collections import Counter
def fillcolumn(ser):
cna=len(ser[ser.isna()])
l=ser[ser.notna()]
d=Counter(l)
m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
ser[ser.isna()]=m
return ser
for i in dataframe.columns:
dataframe[i]=fillcolumn(dataframe[i])
return dataframe
Here are two thoughts on the (interesting.) subject.这是关于(有趣的)主题的两个想法。
apply
apply
fillna(method='ffill')
fillna(method='ffill')
Setup:设置:
df = pd.DataFrame({'a': [1, np.nan, 3, 4, np.nan],
'b': [np.nan, 12, np.nan, np.nan, 15],
'c': [21, np.nan, np.nan, 24, 25],
'd': [31, np.nan, np.nan, 34, 34]})
Example function:示例 function:
def replace_na(x):
"""Replace NaN values with values randomly selected from the Series."""
vc = x.value_counts()
r = random.choices(vc.keys(), weights=vc.values, k=x.isnull().sum())
x[x.isnull()] = r
return x
Apply:申请:
df.apply(lambda x: replace_na(x))
Output: Output:
a b c d
0 1.0 12.0 21.0 31.0
1 4.0 12.0 25.0 34.0
2 3.0 15.0 21.0 34.0
3 4.0 15.0 24.0 34.0
4 1.0 15.0 25.0 34.0
A different thought process... as problem solving is about looking at different angles.不同的思维过程……因为解决问题是从不同的角度看问题。
I acknowledge that this approach does not meet the OP's specific request - but perhaps meets the underlying intent .我承认这种方法不符合 OP 的特定要求- 但可能符合基本意图。
If filling NaN
values with a random value from the column, it might be simpler (and equally as effective) to forward-fill the empty values.如果用列中的随机值填充
NaN
值,则前向填充空值可能更简单(同样有效)。 This would also address the frequency, as more-common values are likely to be followed by a missing value, than a less-common value.这也将解决频率问题,因为与不太常见的值相比,更常见的值后面可能会跟着缺失值。
df.fillna(method='ffill')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.