[英]Fill missing data with random values from categorical column - Python
I'm working on a hotel booking dataset.我正在研究酒店预订数据集。 Within the data frame, there's a discrete numerical column called 'agent' that has 13.7% missing values.在数据框中,有一个名为“代理”的离散数值列,其中有 13.7% 的缺失值。 My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.我的直觉是只删除缺失值的行,但考虑到缺失值的数量并不小,现在我想使用随机抽样插补将它们按比例替换为现有的分类变量。
My code is:我的代码是:
new_agent = hotel['agent'].dropna()
agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))
The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>.前 3 行是 nan,但现在替换为 <function at 0x7ffa2c53d700>。 Is there something wrong with my code, maybe in the lambda syntax?我的代码有问题吗,可能是 lambda 语法?
UPDATE: Thanks ti7 helped me solved the problem:更新:感谢 ti7 帮我解决了这个问题:
new_agent = hotel['agent'].dropna() #get a series of just the available values new_agent = hotel['agent'].dropna() #获取一系列可用的值
n_null = hotel['agent'].isnull().sum() #length of the missing entries n_null = hotel['agent'].isnull().sum() #缺失条目的长度
new_agent.sample(n_null,replace=True).values #sample it with repetition and get values new_agent.sample(n_null,replace=True).values #对它进行重复采样并获取值
hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #fill and replace hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #填充和替换
.fillna()
is naively assigning your function to the missing values. .fillna()
天真地将您的 function 分配给缺失值。 It can do this because functions are really objects!它可以做到这一点,因为函数真的是对象!
You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.您可能希望以某种形式从当前系列中生成具有随机值的新系列(您通过减去长度知道形状)并将其用于缺失值。
.dropna()
)获取一系列可用值( .dropna()
).sample()
it with repetition ( replace=True
) to a new Series of the same length as the missing entries ( df["agent"].isna().sum()
) .sample()
重复( replace=True
)到与缺失条目( df["agent"].isna().sum()
)长度相同的新系列.values
(this is a flat numpy array)获取.values
(这是一个平面 numpy 数组)df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
df["agent"].isna().sum(), # get the same number of values as are missing
replace=True # repeat values
).values # throw out the index
>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
agent b
0 1.0 3
1 2.0 4
2 NaN 5
3 NaN 6
4 10.0 7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
... df["agent"].isna().sum(),
... replace=True
... ).values
>>> df
agent b
0 1.0 3
1 2.0 4
2 10.0 5
3 2.0 6
4 10.0 7
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.