简体   繁体   English

火车测试拆分似乎在Python中无法正常工作?

[英]Train-test split does not seem to work properly in Python?

I am trying to run a kNN (k-nearest neighbour) algorithm in Python. 我正在尝试在Python中运行kNN(k最近邻)算法。

The dataset I am using to try and do this is available at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/wine 我用来尝试执行此操作的数据集可在UCI机器学习存储库中找到: https//archive.ics.uci.edu/ml/datasets/wine

Here is the code I am using: 这是我正在使用的代码:

#1. LIBRARIES
import os
import pandas as pd
import numpy as np
print os.getcwd() # Prints the working directory
os.chdir('C:\\file_path') # Provide the path here

#2. VARIABLES
variables = pd.read_csv('wines.csv')
winery = variables['winery']
alcohol = variables['alcohol']
malic = variables['malic']
ash = variables['ash']
ash_alcalinity = variables['ash_alcalinity']
magnesium = variables['magnesium']
phenols = variables['phenols']
flavanoids = variables['flavanoids']
nonflavanoids = variables['nonflavanoids']
proanthocyanins = variables['proanthocyanins']
color_intensity = variables['color_intensity']
hue = variables['hue']
od280 = variables['od280']
proline = variables['proline']

#3. MAX-MIN NORMALIZATION
alcoholscaled=(alcohol-min(alcohol))/(max(alcohol)-min(alcohol))
malicscaled=(malic-min(malic))/(max(malic)-min(malic))
ashscaled=(ash-min(ash))/(max(ash)-min(ash))
ash_alcalinity_scaled=(ash_alcalinity-min(ash_alcalinity))/(max(ash_alcalinity)-min(ash_alcalinity))
magnesiumscaled=(magnesium-min(magnesium))/(max(magnesium)-min(magnesium))
phenolsscaled=(phenols-min(phenols))/(max(phenols)-min(phenols))
flavanoidsscaled=(flavanoids-min(flavanoids))/(max(flavanoids)-min(flavanoids))
nonflavanoidsscaled=(nonflavanoids-min(nonflavanoids))/(max(nonflavanoids)-min(nonflavanoids))
proanthocyaninsscaled=(proanthocyanins-min(proanthocyanins))/(max(proanthocyanins)-min(proanthocyanins))
color_intensity_scaled=(color_intensity-min(color_intensity))/(max(color_intensity)-min(color_intensity))
huescaled=(hue-min(hue))/(max(hue)-min(hue))
od280scaled=(od280-min(od280))/(max(od280)-min(od280))
prolinescaled=(proline-min(proline))/(max(proline)-min(proline))
alcoholscaled.mean()
alcoholscaled.median()
alcoholscaled.min()
alcoholscaled.max()

#4. DATA FRAME
d = {'alcoholscaled' : pd.Series([alcoholscaled]),
'malicscaled' : pd.Series([malicscaled]),
'ashscaled' : pd.Series([ashscaled]),
'ash_alcalinity_scaled' : pd.Series([ash_alcalinity_scaled]),
'magnesiumscaled' : pd.Series([magnesiumscaled]),
'phenolsscaled' : pd.Series([phenolsscaled]),
'flavanoidsscaled' : pd.Series([flavanoidsscaled]),
'nonflavanoidsscaled' : pd.Series([nonflavanoidsscaled]),
'proanthocyaninsscaled' : pd.Series([proanthocyaninsscaled]),
'color_intensity_scaled' : pd.Series([color_intensity_scaled]),
'hue_scaled' : pd.Series([huescaled]),
'od280scaled' : pd.Series([od280scaled]),
'prolinescaled' : pd.Series([prolinescaled])}
df = pd.DataFrame(d)

#5. TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.matrix(df),np.matrix(winery),test_size=0.3)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

#6. K-NEAREST NEIGHBOUR ALGORITHM
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

In section 5, when I run sklearn.model_selection to import the train-test split mechanism, this does not appear to be running correctly because it provides the shapes: (0,13) (0,178) (1,13) (1,178) . 在第5节中,当我运行sklearn.model_selection导入火车测试拆分机制时,由于提供了以下形状,因此它似乎运行不正常: (0,13) (0,178) (1,13) (1,178)

Then, upon trying to run the knn, I get the error message: Found array with 0 sample(s) (shape=(0,13)) while a minimum of 1 is required. 然后,在尝试运行knn时,出现错误消息: Found array with 0 sample(s) (shape=(0,13)) while a minimum of 1 is required. This is not due to scaling with max-min normalisation as I still get this error message even when the variables are not scaled. 这不是由于使用最大-最小归一化进行缩放,因为即使不对变量进行缩放,我仍然会收到此错误消息。

I'm not exactly sure where your code is going wrong, it's a slightly different way of going about it compared to the sklearn docs. 我不确定您的代码在哪里出问题,与sklearn文档相比,这是稍微不同的处理方式。 However, I can show you a different way of getting the train test split to work on the wine dataset for you. 但是,我可以向您展示将火车测试拆分为葡萄酒数据集的另一种方法。

from sklearn.datasets import load_wine
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = load_wine(return_X_y=True)
X_scaled = MinMaxScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
                                                    test_size=0.3)
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我的交叉验证始终比训练测试拆分表现更好? - Why does my cross-validation consistently perform better than train-test split? 为什么每次我在这个特定的数据集上运行训练测试拆分时,我的 kernel 都会死掉? - Why does my kernel die every time I run train-test split on this particular dataset? 基于python中的多个特征的训练-测试分割的分层交叉验证或抽样 - Stratified Cross Validation or Sampling for train-test split based on multiple features in python 如何意外地训练测试拆分和交叉验证? - How to train-test split and cross-validate in surprise? 用于 LSTM 的时间序列数据的训练测试拆分 - Train-Test split for Time Series Data to be used for LSTM 关于时间序列中训练测试拆分的问题 - Question about Train-Test Split in Time Series 时间序列数据中 LSTM 训练测试拆分中的问题 - Problem in LSTM train-test split in time series data Python:训练测试拆分数据帧时出现类型错误 - Python: TypeError while Train-Test splitting of data-frame 训练/测试Split Python - Train/Test Split Python 如何在训练测试拆分后仅标准化 int64 列? - How do I standardize only int64 columns after train-test split?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM