简体   繁体   中英

The number of data points being classified and plotted do not match the number of points in the dataset

I am using a dataset that has 54 data points to be classified in Python using k-NN classifier with # of neighbours : 20.My code does the classification and plots results but I only see 22 data points in my scatter plot,not 54 data points being classified.

Is there a reason in machine learning why all data points aren't being classified and plotted?

Does the # of neighbours chosen affect the # of data points being classified and plotted? Thanks.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
import pandas as pd
from sklearn import preprocessing

# Preprocessing of dataset done here.
n_neighbors = 20
dataset = pd.read_csv('cereal.csv')
X = dataset.iloc[:, [3,5]].values
y = dataset.iloc[:, 1].values
y_set = preprocessing.LabelEncoder()
y_fit = y_set.fit(y)
y_trans = y_set.transform(y)

# sorting dataset done here.Total number of data points :77 but 54 will 
# be selected to use
j = 0
for i in range (0,77):
if y[i] == 'K' or y[i] == 'G' or y[i] == 'P':
    j = j+1

new_data = np.zeros((j,2))
new_let = [0] * j
j = 0

for i in range (0,77):
if y[i] == 'K' or y[i] == 'G' or y[i] == 'P':
    new_data[j] = X[i]
    new_let[j] = y[i]
    j = j+1

# Plotting and setting up mesh grid done here

h = .02
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Cylassifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y_trans)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

plt.scatter(X[:, 0], X[:, 1], c=y_trans, cmap=cmap_bold,
            edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')"
          % (n_neighbors, weights))
plt.show()

First of all, you're using all 77 points of your dataset in your classifier and in your plot. The variable you created with 54 points in it is not used either to fit the classifier or to produce the resulting plot.

You should check the Variable Explorer in Anaconda after running the script to see the sizes of the different variables you're using.

As to the plot you're generating, if you look at the way the data is distributed you will see why you see only 22 points:

谷物K-NN

If you look at the original dataset, there are several points that share duplicate values in these two columns (fat and calories). As a result, several points are stacked on top of one another on the plot, so although you are plotting 77 points, you only "see" 22 of them on your plot. You might want to pick some other attribute if you want to see them all separated nicely.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM