简体   繁体   English

如何将数据集拆分为 (X_train, y_train), (X_test, y_test)?

[英]How to split dataset into (X_train, y_train), (X_test, y_test)?

The training and validations datasets I am using are shared here for the sake of reproducibility.为了重现性,我正在使用的训练和验证数据集在此处共享

The validation_dataset.csv is the ground truth of training_dataset.csv . validation_dataset.csvtraining_dataset.csv的基本事实。

What I am doing below is feeding the datasets into a simple CNN layer that extracts the useful features of the images and feed that as 1D into the LSTM network for classification.我在下面所做的是将数据集输入一个简单的 CNN 层,该层提取图像的有用特征并将其作为 1D 输入 LSTM 网络进行分类。

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.convolutional import Conv1D
from keras.layers import LSTM
from keras.layers.convolutional import MaxPooling1D
from keras.layers import TimeDistributed
from keras.layers import Dropout
from keras import optimizers
from keras.callbacks import EarlyStopping
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from numpy import genfromtxt

df_train = genfromtxt('data/train/training_dataset.csv', delimiter=',') 
df_validation = genfromtxt('data/validation/validation_dataset.csv', delimiter=',') 

#train,test = train_test_split(df_train, test_size=0.20, random_state=0)


df_train = df_train[..., None] 
df_validation = df_validation[..., None]


batch_size=8
epochs=5
    
model = Sequential()

model.add(Conv1D(filters=5, kernel_size=3, activation='relu', padding='same'))
model.add(MaxPooling1D(pool_size=2))
#model.add(TimeDistributed(Flatten()))
model.add(LSTM(50, return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(LSTM(10))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

adam = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)

model.compile(optimizer="rmsprop", loss='mse', metrics=['accuracy'])
callbacks = [EarlyStopping('val_loss', patience=3)]


model.fit(df_train, df_validation, batch_size=batch_size)

print(model.summary())

   
scores = model.evaluate(df_train, df_validation, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

I want to split the training and validation dataset into (X_train, y_train), (X_test, y_test) so that I can use both datasets for training and testing.我想将训练和验证数据集拆分为(X_train, y_train), (X_test, y_test)以便我可以使用这两个数据集进行训练和测试。 I tried the split function of the Scikit-learn library - train,test = train_test_split(df_train, test_size=0.20, random_state=0) but it is giving me the following error after we invoke the model.fit() function.我尝试了 Scikit-learn 库的拆分函数 - train,test = train_test_split(df_train, test_size=0.20, random_state=0)但是在我们调用model.fit()函数后它给了我以下错误。

ValueError: Data cardinality is ambiguous:
  x sizes: 14384
  y sizes: 3596
Please provide data which shares the same first dimension.

How can we split the dataset into (X_train, y_train) , (X_test, y_test) sharing the same dimension?我们如何将数据集拆分为(X_train, y_train)(X_test, y_test)共享相同的维度?

One way is to have X and Y sets.一种方法是设置 X 和 Y 集。 Here, I assume the column name for Y is 'target'.在这里,我假设 Y 的列名是“目标”。

target = df_train['target']
df_train = df_train.drop(columns=['target'])

X_train, X_test, y_train, y_test = train_test_split(df_train, target, test_size=0.20, random_state=0) X_train, X_test, y_train, y_test = train_test_split(df_train, target, test_size=0.20, random_state=0)

-- ——

It seems that I had initially misunderstood your problem, and "validation_dataset.csv" is your label data.看来我最初误解了您的问题,“validation_dataset.csv”是您的标签数据。 I apologize for not reading correctly.我为没有正确阅读而道歉。

In this case, you do not need a "target" variable, as that is what df_validation would be.在这种情况下,您不需要“目标”变量,因为 df_validation 就是这样。 Therefore, I think the following may work:因此,我认为以下可能有效:

X_train, X_test, y_train, y_test = train_test_split(df_train, df_validation, test_size=0.20, random_state=0)

The training and validations datasets I am using are shared here for the sake of reproducibility.为了可重复性,我正在使用的训练和验证数据集在此处共享

The validation_dataset.csv is the ground truth of training_dataset.csv . validation_dataset.csvtraining_dataset.csv的基本事实。

What I am doing below is feeding the datasets into a simple CNN layer that extracts the useful features of the images and feed that as 1D into the LSTM network for classification.我在下面所做的是将数据集输入一个简单的 CNN 层,该层提取图像的有用特征并将其作为 1D 输入 LSTM 网络进行分类。

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.convolutional import Conv1D
from keras.layers import LSTM
from keras.layers.convolutional import MaxPooling1D
from keras.layers import TimeDistributed
from keras.layers import Dropout
from keras import optimizers
from keras.callbacks import EarlyStopping
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from numpy import genfromtxt

df_train = genfromtxt('data/train/training_dataset.csv', delimiter=',') 
df_validation = genfromtxt('data/validation/validation_dataset.csv', delimiter=',') 

#train,test = train_test_split(df_train, test_size=0.20, random_state=0)


df_train = df_train[..., None] 
df_validation = df_validation[..., None]


batch_size=8
epochs=5
    
model = Sequential()

model.add(Conv1D(filters=5, kernel_size=3, activation='relu', padding='same'))
model.add(MaxPooling1D(pool_size=2))
#model.add(TimeDistributed(Flatten()))
model.add(LSTM(50, return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(LSTM(10))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

adam = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)

model.compile(optimizer="rmsprop", loss='mse', metrics=['accuracy'])
callbacks = [EarlyStopping('val_loss', patience=3)]


model.fit(df_train, df_validation, batch_size=batch_size)

print(model.summary())

   
scores = model.evaluate(df_train, df_validation, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

I want to split the training and validation dataset into (X_train, y_train), (X_test, y_test) so that I can use both datasets for training and testing.我想将训练和验证数据集拆分为(X_train, y_train), (X_test, y_test)以便我可以使用这两个数据集进行训练和测试。 I tried the split function of the Scikit-learn library - train,test = train_test_split(df_train, test_size=0.20, random_state=0) but it is giving me the following error after we invoke the model.fit() function.我尝试了 Scikit-learn 库的拆分函数 - train,test = train_test_split(df_train, test_size=0.20, random_state=0)但是在我们调用model.fit()函数后它给了我以下错误。

ValueError: Data cardinality is ambiguous:
  x sizes: 14384
  y sizes: 3596
Please provide data which shares the same first dimension.

How can we split the dataset into (X_train, y_train) , (X_test, y_test) sharing the same dimension?我们如何将数据集拆分为(X_train, y_train)(X_test, y_test)共享相同的维度?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 X_train、y_train、X_test、y_test 中拆分图像数据集? - How to split an image dataset in X_train, y_train, X_test, y_test? 如何将 tf.data.Dataset 拆分为 x_train、y_train、x_test、y_test for keras - how to split up tf.data.Dataset into x_train, y_train, x_test, y_test for keras sklearn中的x_test、x_train、y_test、y_train有什么区别? - What is the difference between x_test, x_train, y_test, y_train in sklearn? 在 tensorflow 中创建 X_test、X_train、Y_test、Y_train - Create X_test, X_train, Y_test, Y_train in tensorflow Even-Odd Train-Test Split with 2D array input and return the form of (X_train, y_train), (X_test, y_test) 的两个元组 - Even-Odd Train-Test Split with 2D array input and return two tuples of the form (X_train, y_train), (X_test, y_test) 将我自己的数据集转换为 Cifar10 格式 (X_train, y_train),(X_test, y_test) - convery my own datasets to Cifar10 format (X_train, y_train),(X_test, y_test) 我该如何克服这个问题 X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, stratify=Y, random_state=2) - how can I overcome on this problem X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, stratify=Y, random_state=2) 为什么在我将windowed_dataset放入python后X_train、y_train和x_test和y_test的值变成-100(深度学习预测) - why the value of X_train, y_train and x_test and y_test become - 100 after I put windowed_dataset in python (prediction with deep learning ) model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test) 不工作 - model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test) isn't working 使用 train_test_split() 后向我的 x_test 和 y_test 添加额外的实例 - Add extra instances to my x_test and y_test after using train_test_split()
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM