用分类列中的随机值填充缺失数据 - Python

Question

I'm working on a hotel booking dataset.我正在研究酒店预订数据集。 Within the data frame, there's a discrete numerical column called 'agent' that has 13.7% missing values.在数据框中，有一个名为“代理”的离散数值列，其中有 13.7% 的缺失值。 My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.我的直觉是只删除缺失值的行，但考虑到缺失值的数量并不小，现在我想使用随机抽样插补将它们按比例替换为现有的分类变量。

My code is:我的代码是：

new_agent = hotel['agent'].dropna()

agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))

results结果

结果是

The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>.前 3 行是 nan，但现在替换为 <function at 0x7ffa2c53d700>。 Is there something wrong with my code, maybe in the lambda syntax?我的代码有问题吗，可能是 lambda 语法？

UPDATE: Thanks ti7 helped me solved the problem:更新：感谢 ti7 帮我解决了这个问题：

new_agent = hotel['agent'].dropna() #get a series of just the available values new_agent = hotel['agent'].dropna() #获取一系列可用的值

n_null = hotel['agent'].isnull().sum() #length of the missing entries n_null = hotel['agent'].isnull().sum() #缺失条目的长度

new_agent.sample(n_null,replace=True).values #sample it with repetition and get values new_agent.sample(n_null,replace=True).values #对它进行重复采样并获取值

hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #fill and replace hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #填充和替换

Answer 1

.fillna() is naively assigning your function to the missing values. .fillna()天真地将您的 function 分配给缺失值。 It can do this because functions are really objects!它可以做到这一点，因为函数真的是对象！

You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.您可能希望以某种形式从当前系列中生成具有随机值的新系列（您通过减去长度知道形状）并将其用于缺失值。

get a Series of just the available values ( .dropna() )获取一系列可用值（ .dropna() ）
.sample() it with repetition ( replace=True ) to a new Series of the same length as the missing entries ( df["agent"].isna().sum() ) .sample()重复（ replace=True ）到与缺失条目（ df["agent"].isna().sum() ）长度相同的新系列
get the .values (this is a flat numpy array)获取.values （这是一个平面 numpy 数组）
filter the column and assign过滤列并分配

quick code快速代码

df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
    df["agent"].isna().sum(),  # get the same number of values as are missing
    replace=True               # repeat values
).values                       # throw out the index

demo演示

>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
   agent  b
0    1.0  3
1    2.0  4
2    NaN  5
3    NaN  6
4   10.0  7

>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])

>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
...     df["agent"].isna().sum(),
...     replace=True
... ).values
>>> df
   agent  b
0    1.0  3
1    2.0  4
2   10.0  5
3    2.0  6
4   10.0  7

用分类列中的随机值填充缺失数据 - Python

问题描述

results结果

1 个解决方案

解决方案1
0 已采纳 2021-03-03 23:22:26

quick code快速代码

demo演示

用分类列中的随机值填充缺失数据 - Python

问题描述

results结果

1 个解决方案

解决方案1 0 已采纳 2021-03-03 23:22:26

quick code快速代码

demo演示

解决方案1
0 已采纳 2021-03-03 23:22:26