如何将数据集拆分为 (X_train, y_train), (X_test, y_test)？

Question

The training and validations datasets I am using are shared here for the sake of reproducibility.为了重现性，我正在使用的训练和验证数据集在此处共享。

The validation_dataset.csv is the ground truth of training_dataset.csv . validation_dataset.csv是training_dataset.csv的基本事实。

What I am doing below is feeding the datasets into a simple CNN layer that extracts the useful features of the images and feed that as 1D into the LSTM network for classification.我在下面所做的是将数据集输入一个简单的 CNN 层，该层提取图像的有用特征并将其作为 1D 输入 LSTM 网络进行分类。

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.convolutional import Conv1D
from keras.layers import LSTM
from keras.layers.convolutional import MaxPooling1D
from keras.layers import TimeDistributed
from keras.layers import Dropout
from keras import optimizers
from keras.callbacks import EarlyStopping
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from numpy import genfromtxt

df_train = genfromtxt('data/train/training_dataset.csv', delimiter=',') 
df_validation = genfromtxt('data/validation/validation_dataset.csv', delimiter=',') 

#train,test = train_test_split(df_train, test_size=0.20, random_state=0)


df_train = df_train[..., None] 
df_validation = df_validation[..., None]


batch_size=8
epochs=5
    
model = Sequential()

model.add(Conv1D(filters=5, kernel_size=3, activation='relu', padding='same'))
model.add(MaxPooling1D(pool_size=2))
#model.add(TimeDistributed(Flatten()))
model.add(LSTM(50, return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(LSTM(10))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

adam = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)

model.compile(optimizer="rmsprop", loss='mse', metrics=['accuracy'])
callbacks = [EarlyStopping('val_loss', patience=3)]


model.fit(df_train, df_validation, batch_size=batch_size)

print(model.summary())

   
scores = model.evaluate(df_train, df_validation, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

I want to split the training and validation dataset into (X_train, y_train), (X_test, y_test) so that I can use both datasets for training and testing.我想将训练和验证数据集拆分为(X_train, y_train), (X_test, y_test)以便我可以使用这两个数据集进行训练和测试。 I tried the split function of the Scikit-learn library - train,test = train_test_split(df_train, test_size=0.20, random_state=0) but it is giving me the following error after we invoke the model.fit() function.我尝试了 Scikit-learn 库的拆分函数 - train,test = train_test_split(df_train, test_size=0.20, random_state=0)但是在我们调用model.fit()函数后它给了我以下错误。

ValueError: Data cardinality is ambiguous:
  x sizes: 14384
  y sizes: 3596
Please provide data which shares the same first dimension.

How can we split the dataset into (X_train, y_train) , (X_test, y_test) sharing the same dimension?我们如何将数据集拆分为(X_train, y_train) 、 (X_test, y_test)共享相同的维度？

Answer 1

One way is to have X and Y sets.一种方法是设置 X 和 Y 集。 Here, I assume the column name for Y is 'target'.在这里，我假设 Y 的列名是“目标”。

target = df_train['target']
df_train = df_train.drop(columns=['target'])

X_train, X_test, y_train, y_test = train_test_split(df_train, target, test_size=0.20, random_state=0) X_train, X_test, y_train, y_test = train_test_split(df_train, target, test_size=0.20, random_state=0)

-- ——

It seems that I had initially misunderstood your problem, and "validation_dataset.csv" is your label data.看来我最初误解了您的问题，“validation_dataset.csv”是您的标签数据。 I apologize for not reading correctly.我为没有正确阅读而道歉。

In this case, you do not need a "target" variable, as that is what df_validation would be.在这种情况下，您不需要“目标”变量，因为 df_validation 就是这样。 Therefore, I think the following may work:因此，我认为以下可能有效：

X_train, X_test, y_train, y_test = train_test_split(df_train, df_validation, test_size=0.20, random_state=0)

Answer 2

The training and validations datasets I am using are shared here for the sake of reproducibility.为了可重复性，我正在使用的训练和验证数据集在此处共享。

The validation_dataset.csv is the ground truth of training_dataset.csv . validation_dataset.csv是training_dataset.csv的基本事实。

What I am doing below is feeding the datasets into a simple CNN layer that extracts the useful features of the images and feed that as 1D into the LSTM network for classification.我在下面所做的是将数据集输入一个简单的 CNN 层，该层提取图像的有用特征并将其作为 1D 输入 LSTM 网络进行分类。

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.convolutional import Conv1D
from keras.layers import LSTM
from keras.layers.convolutional import MaxPooling1D
from keras.layers import TimeDistributed
from keras.layers import Dropout
from keras import optimizers
from keras.callbacks import EarlyStopping
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from numpy import genfromtxt

df_train = genfromtxt('data/train/training_dataset.csv', delimiter=',') 
df_validation = genfromtxt('data/validation/validation_dataset.csv', delimiter=',') 

#train,test = train_test_split(df_train, test_size=0.20, random_state=0)


df_train = df_train[..., None] 
df_validation = df_validation[..., None]


batch_size=8
epochs=5
    
model = Sequential()

model.add(Conv1D(filters=5, kernel_size=3, activation='relu', padding='same'))
model.add(MaxPooling1D(pool_size=2))
#model.add(TimeDistributed(Flatten()))
model.add(LSTM(50, return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(LSTM(10))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

adam = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)

model.compile(optimizer="rmsprop", loss='mse', metrics=['accuracy'])
callbacks = [EarlyStopping('val_loss', patience=3)]


model.fit(df_train, df_validation, batch_size=batch_size)

print(model.summary())

   
scores = model.evaluate(df_train, df_validation, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

I want to split the training and validation dataset into (X_train, y_train), (X_test, y_test) so that I can use both datasets for training and testing.我想将训练和验证数据集拆分为(X_train, y_train), (X_test, y_test)以便我可以使用这两个数据集进行训练和测试。 I tried the split function of the Scikit-learn library - train,test = train_test_split(df_train, test_size=0.20, random_state=0) but it is giving me the following error after we invoke the model.fit() function.我尝试了 Scikit-learn 库的拆分函数 - train,test = train_test_split(df_train, test_size=0.20, random_state=0)但是在我们调用model.fit()函数后它给了我以下错误。

ValueError: Data cardinality is ambiguous:
  x sizes: 14384
  y sizes: 3596
Please provide data which shares the same first dimension.

How can we split the dataset into (X_train, y_train) , (X_test, y_test) sharing the same dimension?我们如何将数据集拆分为(X_train, y_train) 、 (X_test, y_test)共享相同的维度？

如何将数据集拆分为 (X_train, y_train), (X_test, y_test)？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-02 13:18:18

解决方案2
0 2020-11-02 13:19:58

如何将数据集拆分为 (X_train, y_train), (X_test, y_test)？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-02 13:18:18

解决方案2 0 2020-11-02 13:19:58

解决方案1
1 已采纳 2020-11-02 13:18:18

解决方案2
0 2020-11-02 13:19:58