简体   繁体   English

LabelEncoding() 与 OneHotEncoding() (sklearn,pandas) 建议

[英]LabelEncoding() vs OneHotEncoding() (sklearn,pandas) suggestions

I have 3 types of categorical data in my dataframe, df .我的数据框中有 3 种类型的分类数据, df

df['Vehicles Owned'] = [1,2,3+,2,1,2,3+,2]
df['Sex'] = ['m','m','f','m','f','f','m','m']
df['Income'] = [42424,65326,54652,9463,9495,24685,52536,23535]

What should I do for the df['Vehicles Owned'] ?我应该为df['Vehicles Owned']什么? (one hot encode, labelencode or leave it as is by converting 3+ to integer. I have used integer values as they are. looking for the suggestions as there is order) (一种热编码,labelencode 或通过将 3+ 转换为整数来保持原样。我已经使用了整数值。按顺序寻找建议)

for df['Sex'] , should I labelEncode it or One hot?对于df['Sex'] ,我应该对它进行 labelEncode 还是 One hot? ( as there is no order, I have used One Hot Encoding) (因为没有顺序,我用了One Hot Encoding)

df['Income'] has lots of variations. df['Income']有很多变化。 so should I convert it to bins and use One Hot Encoding explaining low , medium , high incomes?那么我应该将其转换为垃圾箱并使用 One Hot Encoding 来解释lowmediumhigh收入吗?

I would recommend:我会推荐:

  • For sex , one-hot encode, which translates to using a single boolean var for is_female or is_male ;对于sex ,one-hot encode,转换为对is_femaleis_male使用单个布尔is_male for n categories you need n-1 one-hot-encoded vars because the nth is linearly dependent on the first n-1.对于 n 个类别,您需要 n-1 个单热编码变量,因为第 n 个与第一个 n-1 线性相关。

  • For vehicles_owned if you want to preserve order, I would re-map your vars from [1,2,3,3+] to [1,2,3,4] and treat as an int var, or to [1,2,3,3.5] as a float var.对于vehicles_owned如果要维持秩序,我想从你的瓦尔重新映射[1,2,3,3+][1,2,3,4]和治疗为int VAR,或[1,2,3,3.5]作为浮动变量。

  • For income : you should probably just leave that as a float var.对于income :您可能应该将其保留为浮动变量。 Certain models (like GBT models) will likely do some sort of binning under the hood.某些模型(如 GBT 模型)可能会在引擎盖下进行某种分类。 If your income data happens to have an exponential distribution, you might try log ing it.如果您的收入数据恰好具有指数分布,您可以尝试log它。 But just converting it to bins in your own feature-engineering is not what I'd recommend.但是,我不建议仅将其转换为您自己的特征工程中的 bin。

Meta-advice for all these things is set up a cross-validation scheme you're confident in, try different formulations for all your feature-engineering decisions, and then follow your cross-validated performance measure to make your ultimate decision.所有这些事情的元建议是建立一个你有信心的交叉验证方案,为你的所有特征工程决策尝试不同的公式,然后按照你的交叉验证的性能度量来做出你的最终决定。

Finally, between which library/function to use I prefer pandas' get_dummies because it allows you to keep column-names informative in your final feature-matrix like so: https://stackoverflow.com/a/43971156/1870832最后,在使用哪个库/函数之间我更喜欢 pandas 的get_dummies因为它允许您在最终功能矩阵中保留列名信息,如下所示: https : //stackoverflow.com/a/43971156/1870832

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM