it's the first time i approach data analysis, and i'm trying to perform a classification problem. I'm trying to predict the price of a car. I have the following DataFrame (already cleaned):
price vehicleType yearOfRegistration gearbox powerPS model kilometer fuelType brand notRepairedDamage
2 9000 suv 2004 automatik 163 grand 125000 diesel jeep not-declared
3 1500 kleinwagen 2001 manuell 75 golf 150000 benzin volkswagen nein
4 3000 kleinwagen 2008 manuell 69 fabia 90000 diesel skoda nein
6 1500 cabrio 2004 manuell 109 2_reihe 150000 benzin peugeot nein
8 12500 bus 2014 manuell 125 c_max 30000 benzin ford not-declared
... ... ... ... ... ... ... ... ... ... ...
371520 3000 limousine 2004 manuell 225 leon 150000 benzin seat ja
371524 1000 cabrio 2000 automatik 101 fortwo 125000 benzin smart nein
371525 9000 bus 1996 manuell 102 transporter 150000 diesel volkswagen nein
371526 3000 kombi 2002 manuell 100 golf 150000 diesel volkswagen not-declared
371527 25000 limousine 2013 manuell 320 m_reihe 50000 benzin bmw nein
So, as you can see there are categorical atributes. Therefore I have to encode them. I did it this way:
encoding = DataFrameMapper([
(['vehicleType', 'gearbox', 'model', 'fuelType', 'brand', 'notRepairedDamage'],
OneHotEncoder(handle_unknown='ignore')),
(["yearOfRegistration", "powerPS", "kilometer"], OneHotEncoder(handle_unknown='ignore'))
])
encoding_target = DataFrameMapper([
(['price'], None)
])
Here I should mention that I had a column called 'names' with the name and optional of the car. I had to drop that since the dataframe has 250k rows and if I try to encode that column too I get Memory Error.
Then I proceded fitting and transforming:
encoding.fit(data)
encoding_target.fit(data)
X = encoding.transform(data.loc[:, data.columns != "price"])
y = encoding_target.transform(data[['price']])
Then I created the train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)
and then just called the decision tree constructor as:
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
accuracy_score(y_test, y_pred)
I get a score of 0.38. Which is really low. So, I'd like to ask you if the problem is in how I encode the dataframe for using it with sklearn. Is yes, is there a better way? This way I also have problems with cross validation, and I don't feel the dataframe as it is is fully usable with other algorithms. Thanks:)
I'm not sure this is an Stackoverflow question instead of stackexchange or science, since it's not programming related. But here are some tips:
Your low accuracy is mostly by bullet 2 by the way.
edit: Just saw your commend that your y is divided between 12 intervals. So you should do a mix of all tips. Create a new y variable that will be something like if price between 0 and 1k then 0, elif between 1k and 3k then 1 (and so on..)
, maybe using panda's.loc().
If you are predicting the price of a car, price
is your target feature, so it would normally be a regression problem, not classification. The classification model has no idea 1000 is closer to 1500 than 25000 – it just treats them as separate classes. You're model has an accuracy of 0.38 at predicting the class of price.
Try DecisionTreeRegressor()
instead. You can look at some of these metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.