I am using the KNeighborsClassifier() library from Scikit-Learn to predict football outcomes using team names. My training data has 18 statistics of a match, such as goals and number of fouls, and for my test set I can only use 2 team names.
The problem is the number of training features must match the number of test features or else I get
ValueError: query data dimension must match training data dimension
How can I overcome this, while keeping my training set with past statistics and test set with only the team names?
Code
df = pd.read_csv('2013p.csv') # Training data
dftest= pd.read_csv('2014p.csv') # Test Data
X = np.array(df.drop(['FTR','BbAvH','BbAvD','BbAvA'],1)) #features
y = np.array(df['FTR']) #labels classes
Xtest = np.array(dftest.drop(['FTHG','FTAG','FTR','HTHG','HTAG','HS','AS','HST','AST','HF','AF','HC','AC','HY','AY','HR','AR','BbAvH','BbAvD','BbAvA'],1))#features
ytest = np.array(dftest['FTR']) #labels classes
clf= neighbors.KNeighborsClassifier(n_neighbors =19)#New Classifier
clf.fit(X, y)#Fit on train data
results = clf.predict(Xtest)
Data
HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HS, AS, HST, AST, HF, AF
401, 301, 2 , 3 , -1 , 1 , 1 , 5 , 7 , 3 , 5 , 2 , 4
Interesting question and I am not an expert but I think you would need to do one of the following
训练集的形状必须与测试集的形状相同,因此请使用SelectKBest对其进行转换。
It seems that the features you have are not applicable to the problem you are trying to predict. (football matches outcome)
This is what is known as data-leakage, the leakage of information from the future into the past.
The following features are not available for matches that have not played out yet:
FTHG = Full Time Home Team Goals
FTAG = Full Time Away Team Goals
FTR = Full Time Result
HTHG = Half Time Home Team Goals
HTAG = Half Time Away Team Goals
HS = Home Team Shots
AS = Away Team Shots
HST = Home Team Shots on Target
AST = Away Team Shots on Target
HF = Home Team Fouls Committed
AF = Away Team Fouls Committed
Training a model using them would yield unrealistically good predictions. This is because the model will basically learn that when FTHG > FTAG then the outcome is Home victory.
The problem is that the full time home goals are known only after the event has finished so you have no prediction power.
You need to rework your features to only use information that is available for an event before it actually starts.
I have developed Football prediction API which I offer for free and had to carefully select the features in order to avoid data leakage.
For example some good features would be:
These kind of features can be known ahead of the actual event.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.