简体   繁体   中英

Predicting test data with fewer features than training set

I am using the KNeighborsClassifier() library from Scikit-Learn to predict football outcomes using team names. My training data has 18 statistics of a match, such as goals and number of fouls, and for my test set I can only use 2 team names.
The problem is the number of training features must match the number of test features or else I get

ValueError: query data dimension must match training data dimension

How can I overcome this, while keeping my training set with past statistics and test set with only the team names?

Code

df = pd.read_csv('2013p.csv') # Training data
dftest= pd.read_csv('2014p.csv') # Test Data


X = np.array(df.drop(['FTR','BbAvH','BbAvD','BbAvA'],1)) #features
y = np.array(df['FTR']) #labels classes

Xtest = np.array(dftest.drop(['FTHG','FTAG','FTR','HTHG','HTAG','HS','AS','HST','AST','HF','AF','HC','AC','HY','AY','HR','AR','BbAvH','BbAvD','BbAvA'],1))#features

ytest = np.array(dftest['FTR']) #labels classes

clf= neighbors.KNeighborsClassifier(n_neighbors =19)#New Classifier

clf.fit(X, y)#Fit on train data

results = clf.predict(Xtest)

Data
HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HS, AS, HST, AST, HF, AF
401, 301, 2 , 3 , -1 , 1 , 1 , 5 , 7 , 3 , 5 , 2 , 4

Interesting question and I am not an expert but I think you would need to do one of the following

  1. Interpolate missing features based on other features that are present
  2. Average over all possible values the missing feature might have
    • Perhaps assume a likely range for each feature and then run prediction with all possible values of the feature.
    • You do get combinatorial explosion in the number of models you have to fit. Whether you can live with that depends on how many missing features you want to handle and how long each prediction takes

训练集的形状必须与测试集的形状相同,因此请使用SelectKBest对其进行转换。

It seems that the features you have are not applicable to the problem you are trying to predict. (football matches outcome)

This is what is known as data-leakage, the leakage of information from the future into the past.

The following features are not available for matches that have not played out yet:

FTHG = Full Time Home Team Goals
FTAG = Full Time Away Team Goals
FTR  = Full Time Result 
HTHG = Half Time Home Team Goals
HTAG = Half Time Away Team Goals
HS   = Home Team Shots
AS   = Away Team Shots
HST  = Home Team Shots on Target
AST  = Away Team Shots on Target
HF   = Home Team Fouls Committed
AF   = Away Team Fouls Committed

Training a model using them would yield unrealistically good predictions. This is because the model will basically learn that when FTHG > FTAG then the outcome is Home victory.

The problem is that the full time home goals are known only after the event has finished so you have no prediction power.

You need to rework your features to only use information that is available for an event before it actually starts.

I have developed Football prediction API which I offer for free and had to carefully select the features in order to avoid data leakage.

For example some good features would be:

  • Percent of games won when playing at home vs teams similar to current adversary.
  • Average goals scored in the last X games.
  • Average goals conceived in the last games.

These kind of features can be known ahead of the actual event.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM