[英]How to impute NaN values based on values of other column?
I have 2 columns in dataframe 我在数据帧中有2列
1)work experience (years) 1)工作经验(年)
2)company_type 2)COMPANY_TYPE
I want to impute company_type column based on work experience column. 我想根据工作经验栏来估算company_type列。 company_type column has NaN values which I want to fill based on work experience column. company_type列具有我想根据工作经验列填写的NaN值。 Work experience column does not have any missing values. 工作经验列没有任何缺失值。
Here work_exp is numerical data and company_type is categorical data. 这里work_exp是数字数据,company_type是分类数据。
Example data: 示例数据:
Work_exp company_type
10 PvtLtd
0.5 startup
6 Public Sector
8 NaN
1 startup
9 PvtLtd
4 NaN
3 Public Sector
2 startup
0 NaN
I have decided the threshold for imputing NaN values. 我已经确定了输入NaN值的阈值。
Startup if work_exp < 2yrs
Public sector if work_exp > 2yrs and <8yrs
PvtLtd if work_exp >8yrs
Based on above threshold criteria how can I impute missing categorical values in column company_type. 根据以上阈值标准,如何在列company_type中输入缺少的分类值。
You can use numpy.select
with numpy.where
: 您可以将numpy.select
与numpy.where
numpy.select
使用:
# define conditions and values
conditions = [df['Work_exp'] < 2, df['Work_exp'].between(2, 8), df['Work_exp'] > 8]
values = ['Startup', 'PublicSector', 'PvtLtd']
# apply logic where company_type is null
df['company_type'] = np.where(df['company_type'].isnull(),
np.select(conditions, values),
df['company_type'])
print(df)
Work_exp company_type
0 10.0 PvtLtd
1 0.5 startup
2 6.0 PublicSector
3 8.0 PublicSector
4 1.0 startup
5 9.0 PvtLtd
6 4.0 PublicSector
7 3.0 PublicSector
8 2.0 startup
9 0.0 Startup
pd.Series.between
includes start and end values by default, and permits comparison between float
values. pd.Series.between
包含pd.Series.between
和结束值,并允许float
值之间的比较。 Use inclusive=False
argument to omit boundaries. 使用inclusive=False
参数来省略边界。
s = pd.Series([2, 2.5, 4, 4.5, 5])
s.between(2, 4.5)
0 True
1 True
2 True
3 True
4 False
dtype: bool
great answer by @jpp. @jpp的精彩回答。 Just want to add a different approach here using pandas.cut()
. 只想在这里使用pandas.cut()
添加不同的方法。
df['company_type'] = pd.cut(
df.Work_exp,
bins=[0,2,8,100],
right=False,
labels=['Startup', 'Public', 'Private']
)
Work_exp company_type
0 10.0 Private
1 0.5 Startup
2 6.0 Public
3 8.0 Private
4 1.0 Startup
5 9.0 Private
6 4.0 Public
7 3.0 Public
8 2.0 Public
9 0.0 Startup
Also based on your conditions, Index 8 should be public ? 同样根据您的条件,索引8应该是公开的吗?
Startup < 2
PublicSector >=2 and < 8
PvtLtd >= 8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.