简体   繁体   中英

Sklearn classification with DecisionTree, how to improve fit?

it's the first time i approach data analysis, and i'm trying to perform a classification problem. I'm trying to predict the price of a car. I have the following DataFrame (already cleaned):

price   vehicleType yearOfRegistration  gearbox powerPS model   kilometer   fuelType    brand   notRepairedDamage
2   9000    suv 2004    automatik   163 grand   125000  diesel  jeep    not-declared
3   1500    kleinwagen  2001    manuell 75  golf    150000  benzin  volkswagen  nein
4   3000    kleinwagen  2008    manuell 69  fabia   90000   diesel  skoda   nein
6   1500    cabrio  2004    manuell 109 2_reihe 150000  benzin  peugeot nein
8   12500   bus 2014    manuell 125 c_max   30000   benzin  ford    not-declared
... ... ... ... ... ... ... ... ... ... ...
371520  3000    limousine   2004    manuell 225 leon    150000  benzin  seat    ja
371524  1000    cabrio  2000    automatik   101 fortwo  125000  benzin  smart   nein
371525  9000    bus 1996    manuell 102 transporter 150000  diesel  volkswagen  nein
371526  3000    kombi   2002    manuell 100 golf    150000  diesel  volkswagen  not-declared
371527  25000   limousine   2013    manuell 320 m_reihe 50000   benzin  bmw nein

So, as you can see there are categorical atributes. Therefore I have to encode them. I did it this way:

encoding = DataFrameMapper([
    (['vehicleType', 'gearbox', 'model', 'fuelType', 'brand', 'notRepairedDamage'], 
      OneHotEncoder(handle_unknown='ignore')),    
    (["yearOfRegistration", "powerPS", "kilometer"], OneHotEncoder(handle_unknown='ignore'))
    ])

encoding_target = DataFrameMapper([
    (['price'], None)
])

Here I should mention that I had a column called 'names' with the name and optional of the car. I had to drop that since the dataframe has 250k rows and if I try to encode that column too I get Memory Error.

Then I proceded fitting and transforming:

encoding.fit(data)
encoding_target.fit(data)

X = encoding.transform(data.loc[:, data.columns != "price"])
y = encoding_target.transform(data[['price']])

Then I created the train/test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)

and then just called the decision tree constructor as:

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
accuracy_score(y_test, y_pred)

I get a score of 0.38. Which is really low. So, I'd like to ask you if the problem is in how I encode the dataframe for using it with sklearn. Is yes, is there a better way? This way I also have problems with cross validation, and I don't feel the dataframe as it is is fully usable with other algorithms. Thanks:)

I'm not sure this is an Stackoverflow question instead of stackexchange or science, since it's not programming related. But here are some tips:

  1. You don't need to encode your y variable, most sklearn will convert it to numbers. If not, just make a set for mapping (if they are categorical).
  2. It's going to be pretty impossible for an algoritm to get a high number as a prediction. Try transforming you y variable into something else. (ie: Higher or lower than 10k or % increase after X periods).
  3. When your y-categorical-variable has more than 2 options, set the "stratify=y" argument at train_test_split)

Your low accuracy is mostly by bullet 2 by the way.

edit: Just saw your commend that your y is divided between 12 intervals. So you should do a mix of all tips. Create a new y variable that will be something like if price between 0 and 1k then 0, elif between 1k and 3k then 1 (and so on..) , maybe using panda's.loc().

If you are predicting the price of a car, price is your target feature, so it would normally be a regression problem, not classification. The classification model has no idea 1000 is closer to 1500 than 25000 – it just treats them as separate classes. You're model has an accuracy of 0.38 at predicting the class of price.

Try DecisionTreeRegressor() instead. You can look at some of these metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM