简体   繁体   English

Sklearn分类与DecisionTree,如何提高拟合?

[英]Sklearn classification with DecisionTree, how to improve fit?

it's the first time i approach data analysis, and i'm trying to perform a classification problem.这是我第一次进行数据分析,我正在尝试执行分类问题。 I'm trying to predict the price of a car.我正在尝试预测汽车的价格。 I have the following DataFrame (already cleaned):我有以下 DataFrame (已清理):

price   vehicleType yearOfRegistration  gearbox powerPS model   kilometer   fuelType    brand   notRepairedDamage
2   9000    suv 2004    automatik   163 grand   125000  diesel  jeep    not-declared
3   1500    kleinwagen  2001    manuell 75  golf    150000  benzin  volkswagen  nein
4   3000    kleinwagen  2008    manuell 69  fabia   90000   diesel  skoda   nein
6   1500    cabrio  2004    manuell 109 2_reihe 150000  benzin  peugeot nein
8   12500   bus 2014    manuell 125 c_max   30000   benzin  ford    not-declared
... ... ... ... ... ... ... ... ... ... ...
371520  3000    limousine   2004    manuell 225 leon    150000  benzin  seat    ja
371524  1000    cabrio  2000    automatik   101 fortwo  125000  benzin  smart   nein
371525  9000    bus 1996    manuell 102 transporter 150000  diesel  volkswagen  nein
371526  3000    kombi   2002    manuell 100 golf    150000  diesel  volkswagen  not-declared
371527  25000   limousine   2013    manuell 320 m_reihe 50000   benzin  bmw nein

So, as you can see there are categorical atributes.因此,如您所见,存在分类属性。 Therefore I have to encode them.因此我必须对它们进行编码。 I did it this way:我是这样做的:

encoding = DataFrameMapper([
    (['vehicleType', 'gearbox', 'model', 'fuelType', 'brand', 'notRepairedDamage'], 
      OneHotEncoder(handle_unknown='ignore')),    
    (["yearOfRegistration", "powerPS", "kilometer"], OneHotEncoder(handle_unknown='ignore'))
    ])

encoding_target = DataFrameMapper([
    (['price'], None)
])

Here I should mention that I had a column called 'names' with the name and optional of the car.在这里我应该提到我有一个名为“名称”的列,其中包含汽车的名称和可选内容。 I had to drop that since the dataframe has 250k rows and if I try to encode that column too I get Memory Error.我不得不放弃它,因为 dataframe 有 250k 行,如果我也尝试对该列进行编码,我会得到 Memory 错误。

Then I proceded fitting and transforming:然后我进行了拟合和改造:

encoding.fit(data)
encoding_target.fit(data)

X = encoding.transform(data.loc[:, data.columns != "price"])
y = encoding_target.transform(data[['price']])

Then I created the train/test split然后我创建了训练/测试拆分

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)

and then just called the decision tree constructor as:然后将决策树构造函数称为:

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
accuracy_score(y_test, y_pred)

I get a score of 0.38.我的分数是 0.38。 Which is really low.这真的很低。 So, I'd like to ask you if the problem is in how I encode the dataframe for using it with sklearn.所以,我想问你问题是否在于我如何编码 dataframe 以将其与 sklearn 一起使用。 Is yes, is there a better way?是的,有没有更好的方法? This way I also have problems with cross validation, and I don't feel the dataframe as it is is fully usable with other algorithms.这样我也有交叉验证的问题,我不觉得 dataframe 完全可以与其他算法一起使用。 Thanks:)谢谢:)

I'm not sure this is an Stackoverflow question instead of stackexchange or science, since it's not programming related.我不确定这是 Stackoverflow 问题而不是 stackexchange 或科学问题,因为它与编程无关。 But here are some tips:但这里有一些提示:

  1. You don't need to encode your y variable, most sklearn will convert it to numbers.您不需要对 y 变量进行编码,大多数 sklearn 会将其转换为数字。 If not, just make a set for mapping (if they are categorical).如果没有,只需为映射设置一组(如果它们是分类的)。
  2. It's going to be pretty impossible for an algoritm to get a high number as a prediction.算法几乎不可能获得高数字作为预测。 Try transforming you y variable into something else.尝试将您的 y 变量转换为其他内容。 (ie: Higher or lower than 10k or % increase after X periods). (即:X 个周期后高于或低于 10k 或 % 增长)。
  3. When your y-categorical-variable has more than 2 options, set the "stratify=y" argument at train_test_split)当您的 y-categorical-variable 有超过 2 个选项时,在 train_test_split 中设置“stratify=y”参数)

Your low accuracy is mostly by bullet 2 by the way.顺便说一句,您的低准确率主要是由于第 2 条。

edit: Just saw your commend that your y is divided between 12 intervals.编辑:刚刚看到您的称赞,您的 y 分为 12 个间隔。 So you should do a mix of all tips.因此,您应该混合使用所有提示。 Create a new y variable that will be something like if price between 0 and 1k then 0, elif between 1k and 3k then 1 (and so on..) , maybe using panda's.loc().创建一个新的 y 变量,类似于if price between 0 and 1k then 0, elif between 1k and 3k then 1 (and so on..) ,可能使用 panda's.loc()。

If you are predicting the price of a car, price is your target feature, so it would normally be a regression problem, not classification.如果您要预测汽车的价格, price是您的目标特征,因此它通常是回归问题,而不是分类问题。 The classification model has no idea 1000 is closer to 1500 than 25000 – it just treats them as separate classes.分类 model 不知道 1000 比 25000 更接近 1500 - 它只是将它们视为单独的类。 You're model has an accuracy of 0.38 at predicting the class of price.您是 model 在预测 class 价格时的准确度为 0.38。

Try DecisionTreeRegressor() instead.尝试DecisionTreeRegressor()代替。 You can look at some of these metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics您可以查看其中一些指标: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM