I am trying to fill a pandas dataframe NAN using random data of every column, and that random data appears in every column depeding on its frecuency. I have this:
def MissingRandom(dataframe):
import random
dataframe = dataframe.apply(lambda x: x.fillna(
random.choices(x.value_counts().keys(),
weights = list(x.value_counts()))[0]))
return dataframe
I get the DataFrame filled in with random data but its the same data for all the missing data of the column. I would like this data to be different for every missing of the column but I am not able to do it. Could anybody help me?
Thank you very much
Please see below my solution. Firstly i created a function that fills a series based on your criteria (frequencies as weights in the random function) and finally, we apply this function to all clumns of the dataframe:
from collections import Counter
def fillcolumn(ser):
cna=len(ser[ser.isna()])
l=ser[ser.notna()]
d=Counter(l)
m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
ser[ser.isna()]=m
return ser
for i in df.columns:
df[i]=fillcolumn(df[i])
Your full code:
def MissingRandom(dataframe):
import random
from collections import Counter
def fillcolumn(ser):
cna=len(ser[ser.isna()])
l=ser[ser.notna()]
d=Counter(l)
m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
ser[ser.isna()]=m
return ser
for i in dataframe.columns:
dataframe[i]=fillcolumn(dataframe[i])
return dataframe
Here are two thoughts on the (interesting.) subject.
apply
fillna(method='ffill')
Setup:
df = pd.DataFrame({'a': [1, np.nan, 3, 4, np.nan],
'b': [np.nan, 12, np.nan, np.nan, 15],
'c': [21, np.nan, np.nan, 24, 25],
'd': [31, np.nan, np.nan, 34, 34]})
Example function:
def replace_na(x):
"""Replace NaN values with values randomly selected from the Series."""
vc = x.value_counts()
r = random.choices(vc.keys(), weights=vc.values, k=x.isnull().sum())
x[x.isnull()] = r
return x
Apply:
df.apply(lambda x: replace_na(x))
Output:
a b c d
0 1.0 12.0 21.0 31.0
1 4.0 12.0 25.0 34.0
2 3.0 15.0 21.0 34.0
3 4.0 15.0 24.0 34.0
4 1.0 15.0 25.0 34.0
A different thought process... as problem solving is about looking at different angles.
I acknowledge that this approach does not meet the OP's specific request - but perhaps meets the underlying intent .
If filling NaN
values with a random value from the column, it might be simpler (and equally as effective) to forward-fill the empty values. This would also address the frequency, as more-common values are likely to be followed by a missing value, than a less-common value.
df.fillna(method='ffill')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.