简体   繁体   English

用分类列中的随机值填充缺失数据 - Python

[英]Fill missing data with random values from categorical column - Python

I'm working on a hotel booking dataset.我正在研究酒店预订数据集。 Within the data frame, there's a discrete numerical column called 'agent' that has 13.7% missing values.在数据框中,有一个名为“代理”的离散数值列,其中有 13.7% 的缺失值。 My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.我的直觉是只删除缺失值的行,但考虑到缺失值的数量并不小,现在我想使用随机抽样插补将它们按比例替换为现有的分类变量。

My code is:我的代码是:

new_agent = hotel['agent'].dropna()

agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))

results结果

结果是

The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>.前 3 行是 nan,但现在替换为 <function at 0x7ffa2c53d700>。 Is there something wrong with my code, maybe in the lambda syntax?我的代码有问题吗,可能是 lambda 语法?

UPDATE: Thanks ti7 helped me solved the problem:更新:感谢 ti7 帮我解决了这个问题:

new_agent = hotel['agent'].dropna() #get a series of just the available values new_agent = hotel['agent'].dropna() #获取一系列可用的值

n_null = hotel['agent'].isnull().sum() #length of the missing entries n_null = hotel['agent'].isnull().sum() #缺失条目的长度

new_agent.sample(n_null,replace=True).values #sample it with repetition and get values new_agent.sample(n_null,replace=True).values #对它进行重复采样并获取值

hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #fill and replace hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #填充和替换

.fillna() is naively assigning your function to the missing values. .fillna()天真地将您的 function 分配给缺失值。 It can do this because functions are really objects!它可以做到这一点,因为函数真的是对象!

You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.您可能希望以某种形式从当前系列中生成具有随机值的新系列(您通过减去长度知道形状)并将其用于缺失值。

  • get a Series of just the available values ( .dropna() )获取一系列可用值( .dropna()
  • .sample() it with repetition ( replace=True ) to a new Series of the same length as the missing entries ( df["agent"].isna().sum() ) .sample()重复( replace=True )到与缺失条目( df["agent"].isna().sum() )长度相同的新系列
  • get the .values (this is a flat numpy array)获取.values (这是一个平面 numpy 数组)
  • filter the column and assign过滤列并分配

quick code快速代码

df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
    df["agent"].isna().sum(),  # get the same number of values as are missing
    replace=True               # repeat values
).values                       # throw out the index

demo演示

>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
   agent  b
0    1.0  3
1    2.0  4
2    NaN  5
3    NaN  6
4   10.0  7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
...     df["agent"].isna().sum(),
...     replace=True
... ).values
>>> df
   agent  b
0    1.0  3
1    2.0  4
2   10.0  5
3    2.0  6
4   10.0  7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM