简体   繁体   中英

How do I predict future results with scikitlearn, pandas in Python using RandomForestRegressor method?

Hello I came across this tutorial on how to use python with some libraries to predict future NCAAB games using a sportsreference library. I will post the code as well as the article. This seems to work well, but I think it is only testing based on games in the past. How would I use it to predict future games of specific teams? For example, what will be the score between Team A and Team B on This Date?

The problem I see is that some of the data used can only be known after the game is finished. This known data is what is being used in the program to predict the score.

First Experiment: I tried filling in only the data that I knew on a game before it happened and filling in the remaining data with zero's using fillna(0). Here is what the the csv would look like:

date_team,away_assist_percentage,away_assists,away_block_percentage,away_blocks,away_defensive_rating,away_defensive_rebound_percentage,away_defensive_rebounds,away_effective_field_goal_percentage,away_field_goal_attempts,away_field_goal_percentage,away_field_goals,away_free_throw_attempt_rate,away_free_throw_attempts,away_free_throw_percentage,away_free_throws,away_losses,away_minutes_played,away_offensive_rating,away_offensive_rebound_percentage,away_offensive_rebounds,away_personal_fouls,away_points,away_steal_percentage,away_steals,away_three_point_attempt_rate,away_three_point_field_goal_attempts,away_three_point_field_goal_percentage,away_three_point_field_goals,away_total_rebound_percentage,away_total_rebounds,away_true_shooting_percentage,away_turnover_percentage,away_turnovers,away_two_point_field_goal_attempts,away_two_point_field_goal_percentage,away_two_point_field_goals,away_win_percentage,away_wins,home_assist_percentage,home_assists,home_block_percentage,home_blocks,home _defensive_rating,home_defensive_rebound_percentage,home_defensive_rebounds,home_effective_field_goal_percentage,home_field_goal_attempts,home_field_goal_percentage,home_field_goals,home_free_throw_attempt_rate,home_free_throw_attempts,home_free_throw_percentage,home_free_throws,home_losses,home_minutes_played,home_offensive_rating,home_offensive_rebound_percentage,home_offensive_rebounds,home_personal_fouls,home_points,home_steal_percentage,home_steals,home_three_point_attempt_rate,home_three_point_field_goal_attempts,home_three_point_field_goal_percentage,home_three_point_field_goals,home_total_rebound_percentage,home_total_rebounds,home_true_shooting_percentage,home_turnover_percentage,home_turnovers,home_two_point_field_goal_attempts,home_two_point_field_goal_percentage,home_two_point_field_goals,home_win_percentage,home_wins,pace 0,0,0,0,0,0,0,0,0,59,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.7,7,0,0,0,0,0,0,0,0,0,0,42,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,.1,1,0 The final line of code is changed to: print(model.predict(final_trim).astype(int), y_test)

"final_trim" being the new csv that is being predicted.

The results were not accurate at all. What am I missing?

Here is the original code:

import pandas as pd
from sportsreference.ncaab.teams import Teams
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

FIELDS_TO_DROP = ['away_points', 'home_points', 'date', 'location',
                  'losing_abbr', 'losing_name', 'winner', 'winning_abbr',
                  'winning_name', 'home_ranking', 'away_ranking']

dataset = pd.DataFrame()
teams = Teams()
for team in teams:
    dataset = pd.concat([dataset, team.schedule.dataframe_extended])
X = dataset.drop(FIELDS_TO_DROP, 1).dropna().drop_duplicates()
y = dataset[['home_points', 'away_points']].values
X_train, X_test, y_train, y_test = train_test_split(X, y)
parameters = {'bootstrap': False,
              'min_samples_leaf': 3,
              'n_estimators': 50,
              'min_samples_split': 10,
              'max_features': 'sqrt',
              'max_depth': 6}
model = RandomForestRegressor(**parameters)
model.fit(X_train, y_train)
print(model.predict(X_test).astype(int), y_test)

And here is the post I got it from: https://towardsdatascience.com/predict-college-basketball-scores-in-30-lines-of-python-148f6bd71894

Thank you!

Think of it this way, if you want to test the goodness of fit of your model, then you must know in advance the result so you can measure the distance between your (model) output and the real outcome and perform the necessary tuning to improve your model's overall performance.

Once you have trained your model, if you want to predict future values, then (without much knowledge of what you are working) you should feed your model the same features it was trained with, but with the data you will be making your prediction on. Here is a very basic example using two variables to predict the score of two teams (A and B):

import pandas as pd 
data = {'Temperature':[10,20,30,25],'Humidity':[40,50,80,65],'Score_A':[1,2,3,2],'Score_B':[6,3,1,2]}
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
df = pd.DataFrame(data)
print(df)
X = df[['Temperature','Humidity']]
Y = df[['Score_A','Score_B']]
X_train, X_test, y_train, y_test = train_test_split(X, Y,random_state=42)
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

Here I've trained my model, so if I want to make a future prediction, I would need to pass the same features I've used in training (Temperature and humidity) but with the values I want to make my prediction on. Let's say our friend the meteorologist says that the temperature and humidity for thier next match will be 35 and 70 respectively. So I need to use .predict() with those values:

model.predict(print(model.predict([[35,70]])) 

Which returns an output of:

[[2.6 1.4]]

If you wish to make it fancier:

prediction = model.predict([[35,70]])
print("Team A will score: ",prediction[0][0])
print("Team B will score: ",prediction[0][1])

Returning:

Team A will score:  2.6
Team B will score:  1.4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM