简体   繁体   中英

Python Sklearn Predicting values on an unseen data set

I have a set of football data in a database that I am trying to predict values for.

import MySQLdb
import pandas as pd
from sklearn.feature_selection import RFE
from sqlalchemy import create_engine
import mysql.connector
from matplotlib import pyplot

mysql_cn= MySQLdb.connect(host='database.rds.amazonaws.com',port=3306,user='username', passwd='password', db='dev')
games = pd.read_sql('SELECT game_id, game_date_id, home_team_id, away_team_id, referee_id, FTR, away_team_travel FROM   
dev.tmp_all_output_id  WHERE game_id < 6700;', con=mysql_cn)    

predict_games = pd.read_sql('SELECT game_id, game_date_id,      
home_team_id, away_team_id, referee_id, -10 AS FTR, away_team_travel FROM dev.tmp_all_output_id  WHERE game_id > 6700;', con=mysql_cn)

feature_names = ['game_id', 'game_date_id', 'home_team_id', 'away_team_id', 'referee_id', 'away_team_travel']
X = games[feature_names]
y = games['FTR']

# #Create Training and Test Sets and Apply Scaling
from sklearn.model_selection import train_test_split
validation_size = 0.20
seed = 7
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_size, random_state=0)

from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)

predictions = ada.predict(X_test)

print('Accuracy of AdaBoostClassifier on training set: {:.2f}'.format(ada.score(X_train, y_train)))
print('Accuracy of AdaBoostClassifier on test set: {:.2f}'.format(ada.score(X_test, y_test)))

#cnx = create_engine('mysql+mysqlconnector://username:password@database.rds.amazonaws.com:3306/dev', echo=False)
#testResults.to_sql(name='tmp_all_output_prediction', con=cnx, if_exists = 'replace', index=False)

mysql_cn.close()

Once I have loaded my data set into a data frame and run a test_train_split and a fit on in, how do I predict values for an unseen data set and return the game_id's and the prediction values (FTR)?

As you can see in the code, I have a table (tmp_all_output_id) where I select known result values into 'games' and select unknown (or unplayed) results into 'predict_games'. I also set FTR (full time result) for 'predict_games' = -10 as at this point the result of these games is not yet known.

But how do I use the training that I have done to predict FTR for the data frame 'predict_games'?

I tried to use this code to predict, however it always came back with 0 (draw) for FTR which is certainly not correct.

testResults = predict_games[['game_id']]
testResults.is_copy = None
testResults['FTR'] = raw_prediction

I have added the following code:

unseen_prediction = predict_games[feature_names] new_predictions = ada.predict(unseen_prediction) print new_predictions

However every predicted value is returned as: -1 (away win) which is not correct

Your ada variable is now a trained classifier instance. In order to use it to classify new data, you construct an X with the data in a format corresponding to 'game_id', 'game_date_id', 'home_team_id', 'away_team_id', 'referee_id', 'away_team_travel' .

Then you run ada.predict(X) and you're done!

The issue is that you're currently only passing in the game_id.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM