简体   繁体   English

如何根据其他列的值来估算NaN值?

[英]How to impute NaN values based on values of other column?

I have 2 columns in dataframe 我在数据帧中有2列

1)work experience (years) 1)工作经验(年)

2)company_type 2)COMPANY_TYPE

I want to impute company_type column based on work experience column. 我想根据工作经验栏来估算company_type列。 company_type column has NaN values which I want to fill based on work experience column. company_type列具有我想根据工作经验列填写的NaN值。 Work experience column does not have any missing values. 工作经验列没有任何缺失值。

Here work_exp is numerical data and company_type is categorical data. 这里work_exp是数字数据,company_type是分类数据。

Example data: 示例数据:

Work_exp      company_type
   10            PvtLtd
   0.5           startup
   6           Public Sector
   8               NaN
   1             startup
   9              PvtLtd
   4               NaN
   3           Public Sector
   2             startup
   0               NaN 

I have decided the threshold for imputing NaN values. 我已经确定了输入NaN值的阈值。

Startup if work_exp < 2yrs
Public sector if work_exp > 2yrs and <8yrs
PvtLtd if work_exp >8yrs

Based on above threshold criteria how can I impute missing categorical values in column company_type. 根据以上阈值标准,如何在列company_type中输入缺少的分类值。

You can use numpy.select with numpy.where : 您可以将numpy.selectnumpy.where numpy.select使用:

# define conditions and values
conditions = [df['Work_exp'] < 2, df['Work_exp'].between(2, 8), df['Work_exp'] > 8]
values = ['Startup', 'PublicSector', 'PvtLtd']

# apply logic where company_type is null
df['company_type'] = np.where(df['company_type'].isnull(),
                              np.select(conditions, values),
                              df['company_type'])

print(df)

   Work_exp  company_type
0      10.0        PvtLtd
1       0.5       startup
2       6.0  PublicSector
3       8.0  PublicSector
4       1.0       startup
5       9.0        PvtLtd
6       4.0  PublicSector
7       3.0  PublicSector
8       2.0       startup
9       0.0       Startup

pd.Series.between includes start and end values by default, and permits comparison between float values. pd.Series.between包含pd.Series.between和结束值,并允许float值之间的比较。 Use inclusive=False argument to omit boundaries. 使用inclusive=False参数来省略边界。

s = pd.Series([2, 2.5, 4, 4.5, 5])

s.between(2, 4.5)

0     True
1     True
2     True
3     True
4    False
dtype: bool

great answer by @jpp. @jpp的精彩回答。 Just want to add a different approach here using pandas.cut() . 只想在这里使用pandas.cut()添加不同的方法。

df['company_type'] = pd.cut(
    df.Work_exp,
    bins=[0,2,8,100],
    right=False,
    labels=['Startup', 'Public', 'Private']
)



   Work_exp company_type
0   10.0    Private
1   0.5     Startup
2   6.0     Public
3   8.0     Private
4   1.0     Startup
5   9.0     Private
6   4.0     Public
7   3.0     Public
8   2.0     Public
9   0.0     Startup

Also based on your conditions, Index 8 should be public ? 同样根据您的条件,索引8应该是公开的吗?

  • Startup < 2
  • PublicSector >=2 and < 8
  • PvtLtd >= 8

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM