[英]Fill in missing values differently for different columns in pandas
Say I have a dataframe with different types of columns - numeric and categorical. 假设我有一个具有不同类型列的数据框-数字列和分类列。 I want to fill in median values for numeric columns and sample random value for categorical. 我想为数字列填写中位数,为分类列样本随机值。
This is what I am doing so far: 这是我到目前为止所做的:
def fill_nulls(df, num_cols, cat_cols):
for col in num_cols:
dic[col] = 'median'
for col in cat_cols:
dic[col] = lambda x: x.sample(1)
df = df.apply(dic) #NOT SURE WHAT MUST BE HERE
I am creating a dictionary specifying the desired methods for each column. 我正在创建一个字典,为每列指定所需的方法。 But right now I am not sure how to make this work for missing values. 但是现在我不确定如何使缺失的值有效。 I believe it should be something like apply(dic)
,but I am not sure how to make this to be applied for missing values only. 我相信应该像apply(dic)
,但是我不确定如何使它仅适用于缺失值。
Thanks! 谢谢!
EDIT: 编辑:
What I am doing currently: 我目前在做什么:
for col in cat_cols:
bools = pd.notnull(df[col])
notnulls = df[col][bools]
sample = notnulls.sample(1)
sample = sample.tolist()[0]
df[col] = df[col].fillna(value=sample)
for col in num_cols:
med = df[col].median()
print(type(med))
df[col] = df[col].fillna(value=med)
It is probably not the most efficient way of doing it. 这可能不是最有效的方法。 So if anyone knows better way it would be nice to know! 因此,如果有人知道更好的方法,那真是太好了! thanks! 谢谢!
I have assumed here that your data consists only of numeric and categorical columns (no datetime columns). 我在这里假设您的数据仅由数字和类别列组成(没有日期时间列)。 To demonstrate, first set up some sample data: 为了演示,首先设置一些示例数据:
import numpy as np
import pandas as pd
df = pd.DataFrame({0: ["0:00", np.nan, "12:00", np.nan, "06:00"],
1: [np.nan, 4, 12, 2, np.nan],
2: [100, 2, np.nan, -3.6, np.nan],
3: ["a", "b", "a", np.nan, np.nan]})
df
0 1 2 3
0 0:00 NaN 100.0 a
1 NaN 4.0 2.0 b
2 12:00 12.0 NaN a
3 NaN 2.0 -3.6 NaN
4 06:00 NaN NaN NaN
Now, fill in missing values as per your requirements: 现在,根据您的要求填写缺失值:
# Fill numeric types with median
df = df.fillna(df.median())
# Fill rest of columns (categorical) with random value
df.apply(lambda x: x.fillna(np.random.choice(x[~x.isnull()])))
df
0 1 2 3
0 0:00 4.0 100.0 a
1 0:00 4.0 2.0 b
2 12:00 12.0 2.0 a
3 0:00 2.0 -3.6 a
4 06:00 4.0 2.0 a
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.