Inconsistent number of samples K Nearest Neighbor sklearn

Question

I'm doing some self training on stuff from guidetodatamining.com and am working on some K Nearest Neightbor stuff using sklearn. I am getting the error: ValueError: Found input variables with inconsistent numbers of samples: [2, 20]

When I run this code:

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
cols= ['Name', 'Sport', 'Height', 'Weight']
df = pd.read_table("https://raw.githubusercontent.com/zacharski/pg2dm-python/master/data/ch4/athletesTrainingSet.txt",  names = cols, index_col='Name')
df = df[1:]
df = df[ ['Height', 'Weight','Sport'] ]
knn = KNeighborsClassifier(n_neighbors=2)
X= df.Height, df.Weight
y = df.Sport
knn.fit(X, y)
knn.predict(X)

In the dataset there are 20 in each of the three rows so I have no idea whats happening. I am trying to use the Height and Weight friends to help train the Sport field, so that if you put some data in it "recommends" what sport a person would play. I know theres several similar topic about the LinearRegression tool but I can't get any of the solutions on those to work for me. I have tried reshaping my data, and I have tried doing just height or weight but that gives me an error on a 1D instead of 2D array.

Even just a helpful nudge in the right direction would be incredibly helpful as I have been staring at this for 2 days now with no solution. Thank you.

Answer 1

Your problem is in your x,y creation. x is two pandas data series and y is just one serie. Creating two new dataframes can solve your problem. You can run your code line by line to locate it.

x= df[["Height","Weight"]]
y = df[["Sport"]]

You can try splitting your dataset into to sets. Your model can be train on one set and be validated in the other.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
knn.fit(X_train, y_train)
knn.predict(X_test)

You can save your target value and prediction into a dataframe and check them.

comp_results=pd.concat([y_test,pd.DataFrame(data=knn.predict(X_test),index=y_test.index.values.tolist())],axis =1).rename(columns={"Sport":"Target",0:"Prediction"})

Inconsistent number of samples K Nearest Neighbor sklearn

Question

1 answers

solution1
0 ACCPTED 2018-10-11 21:55:47

Inconsistent number of samples K Nearest Neighbor sklearn

Question

1 answers

solution1 0 ACCPTED 2018-10-11 21:55:47

solution1
0 ACCPTED 2018-10-11 21:55:47