简体   繁体   English

train_test_split 不拆分数据

[英]train_test_split not splitting data

There is a dataframe which consists of 14 columns in total, the last column is the target label with integer values = 0 or 1.有一个总共由 14 列组成的数据框,最后一列是整数值 = 0 或 1 的目标标签。

I have defined -我已经定义——

  1. X = df.iloc[:,1:13] ---- this consists of the feature values X = df.iloc[:,1:13] ---- 这包括特征值
  2. Ly = df.iloc[:,-1] ------ this consists of the corresponding labels Ly = df.iloc[:,-1] ------ 这由相应的标签组成

Both have same length as desired, X is the dataframe that consists of 13 columns, shape (159880, 13), y is an array type with shape(159880,)两者都具有所需的相同长度,X 是由 13 列组成的数据框,形状为 (159880, 13),y 是形状为 (159880,) 的数组类型

But when i perform train_test_split on X,y - the function is not working properly.但是当我在 X,y 上执行 train_test_split 时 - 该功能无法正常工作。

Below is the straightforward code -下面是简单的代码 -

X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0) X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0)

After this split, both X_train and X_test have shape (119910,13).在此拆分之后,X_train 和 X_test 都具有形状 (119910,13)。 y_train is having shape (39970,13) and y_test is having shape (39970,) y_train 有形状 (39970,13) y_test 有形状 (39970,)

This is weird, even after defining test_size parameter, the results stay same.这很奇怪,即使定义了 test_size 参数后,结果仍然保持不变。

Please advise, what could have been going wrong.请指教,可能出了什么问题。

import pandas as pd

import numpy as np from sklearn.tree import DecisionTreeClassifier from adspy_shared_utilities import plot_feature_importances from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression导入 numpy as np from sklearn.tree import DecisionTreeClassifier from adspy_shared_utilities import plot_feature_importances from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression

def model():定义模型():

df = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
df = df[np.isfinite(df['compliance'])]
df = df.fillna(0)
df['compliance'] = df['compliance'].astype('int')
df = df.drop(['grafitti_status', 'violation_street_number','violation_street_name','violator_name',
              'inspector_name','mailing_address_str_name','mailing_address_str_number','payment_status',
              'compliance_detail', 'collection_status','payment_date','disposition','violation_description',
              'hearing_date','ticket_issued_date','mailing_address_str_name','city','state','country',
              'violation_street_name','agency_name','violation_code'], axis=1)
df['violation_zip_code'] = df['violation_zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['zip_code'] = df['zip_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['non_us_str_code'] = df['non_us_str_code'].replace(['ONTARIO, Canada',', Australia','M3C1L-7000'], 0)
df['violation_zip_code'] = pd.to_numeric(df['violation_zip_code'], errors='coerce')
df['zip_code'] = pd.to_numeric(df['zip_code'], errors='coerce')
df['non_us_str_code'] = pd.to_numeric(df['non_us_str_code'], errors='coerce')
#df.violation_zip_code = df.violation_zip_code.replace('-','', inplace=True)
df['violation_zip_code'] = np.nan_to_num(df['violation_zip_code'])
df['zip_code'] = np.nan_to_num(df['zip_code'])
df['non_us_str_code'] = np.nan_to_num(df['non_us_str_code'])
X = df.iloc[:,0:13]
y = df.iloc[:,-1]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 0)    
print(y_train.shape)

你把train_test_split的结果搞混了,应该是

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)
if args.mode == "train":

    # Load Data
    data, labels = load_dataset('C:/Users/PC/Desktop/train/k')

    # Train ML models
    knn(data, labels,'C:/Users/PC/Desktop/train/knn.pkl' )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用来自 sklearn 的 train_test_split 错误拆分数据 - error splitting data using the train_test_split from sklearn 在 python 中使用 train_test_split 将数据分成训练和测试时缺少一行 - one row is missing while splitting the data into train and test using train_test_split in python 使用 train_test_split 拆分数据时的准确性与之后加载 csv 文件的准确性不同 - Different accuracy when splitting data with train_test_split than loading csv file afterwards 使用train_test_split与手动拆分数据时的结果不同 - Different results when using train_test_split vs manually splitting the data 如何在不使用train_test_split()的情况下拆分数据集? - How to split the data set without train_test_split()? train_test_split:值错误 - train_test_split: ValueError 带有test_size = 0的train_test_split如何影响数据? - How is train_test_split with test_size=0 affecting the data? train_test_split在分层数据上无法按预期工作 - train_test_split not working as expected on stratified data 更正了 train_test_split 的不同数据大小 - Corrected -Different sizes of data for train_test_split 使用来自 train_test_split() 的值列表作为训练数据 - Using a list of values from train_test_split() as training data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM