简体   繁体   English

scikit-learn,线性回归中的分类(但数值)特征

[英]scikit-learn, categorical (but numerical) features in Linear Regression

I'm using Linear Regression in scikit-learn and my dataset contains some cateogorical but numerical features.我在 scikit-learn 中使用线性回归,我的数据集包含一些分类但数值特征。 I mean that there are features such as the value of the district where the house is that are expressed by an integer number between 1 and 7: the more this number is high, the more the house is of value.我的意思是有一些特征,比如房子所在地区的价值,用一个介于 1 到 7 之间的 integer 数字表示:这个数字越高,房子的价值就越高。 Should I preprocess a feature that expresses a category (the district of the city) using numbers before Linear Regression with encoders such as OneHotEncoder?我是否应该在使用 OneHotEncoder 等编码器进行线性回归之前使用数字预处理表示类别(城市区域)的特征? Or is it compulsory only when the category is expressed by characters?还是仅当类别用字符表示时才强制? Thank you in advance..先感谢您..

If I understand correctly, you don't need to one hot encode these since they are ordinal, ie there is meaning in the order.如果我理解正确,您不需要对它们进行一次热编码,因为它们是序数,即顺序是有意义的。 If the numbers were product codes, for example, and there was no sense of 7 being "better than" or "more than" 4, then you would want to one-hot encode those variables, but in this case you would be losing information by one-hot encoding.例如,如果数字是产品代码,并且没有感觉 7 是“优于”或“超过”4,那么您可能希望对这些变量进行一次性编码,但在这种情况下,您将丢失信息通过 one-hot 编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM