拆分 scikit 的测试/训练数据？

Question

I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use).我得到了一些入门代码，但我不确定在调用 train_test_split（我被明确告知要使用）时如何拆分它。 Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?本质上，当我已经获得 X_train、Y_train 和 X_test 拆分时，它在哪里发挥作用？

The starter code looks like so:起始代码如下所示：

train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])

##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)

I need to change the acc_svc variable to be using X_test and Y_test, however.但是，我需要将 acc_svc 变量更改为使用 X_test 和 Y_test。 X_test is given to us, but how do I come up with a Y_test? X_test 给了我们，但是我怎么想出一个 Y_test 呢？ I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so.我知道 Y_test 应该对应于标签，并且当我尝试这样做时，我遇到了一些尺寸不匹配的情况。 Should be a simple question, anyone mind pointing me in the right direction?应该是一个简单的问题，有人介意给我指出正确的方向吗？

Answer 1

The test_preprocessed.csv shouldn't be used to check your model performance. test_preprocessed.csv 不应用于检查您的 model 性能。 Split your train_df using train_test_split() in scikit-learn into train and validation datasets.使用 scikit-learn 中的 train_test_split() 将您的 train_df 拆分为训练和验证数据集。 You have to check your model performance on validation dataset ie y of validation.您必须检查验证数据集上的 model 性能，即验证的 y。 Please refer to: scikit-learn documentation请参考： scikit-learn文档

Answer 2

First of all, you have to understand and clarify your target variable.首先，您必须了解并明确您的目标变量。 Your " Y_test " seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set).你的“ Y_test ”似乎是你已经存在的“Y_pred”变量，它似乎对应于“Survived”label（在你的测试集中）。 However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".但是，尽管您将它从“X_train”中删除以便可以将其用作目标，但您似乎并没有在“Y_train”中执行相同的操作，而是在其中删除了“PassengerId”。
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files).这里的另一个基本概念是你的数据集已经被分成训练测试子集（你的 CSV 文件）。 I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file.我假设您的测试集与训练集相比已经少了一列，并且应该是“Survived”变量作为训练 CSV 文件的延续。 Otherwise, you should drop it to avoid mismatching and keep that as your test target variable.否则，您应该删除它以避免不匹配并将其保留为您的测试目标变量。 You don't have to come up with a "Y_test" , the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".您不必提出“Y_test” ，您的等式“Y_pred = svc.predict(X_test)”的结果将为您提供“Y_test”，这将是“Y_pred”的结果。
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.大小不匹配的一个可能原因是训练集中的列数（x 轴）与测试集中的列数不相等。
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split.如果你想根据 Scikit-learn 拆分成训练/测试子集，你将首先合并你的 CSV 文件，然后在合并的数据集中进行数据分析，最后进行拆分。 One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge.跟踪这些更改并保持训练测试拆分的原始大小相同的一种方法可能是保留源自训练测试合并的键值对。 One way to do that could be via the pandas.concat , using the parameter "keys" .一种方法是通过pandas.concat ，使用参数"keys" 。

Incorporating the above, one recommended simple solution might be:结合以上内容，一个推荐的简单解决方案可能是：

# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')

# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])

# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])

# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)

# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]

# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)

In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way @SunilG mentions.在我看来，理解以上内容后，您可以使用cross_val_score function 以 @SunilG 提到的方式进一步估计和比较模型的性能。 For eg a 3-fold (CV=3) cross validation, you could:对于例如 3 折 (CV=3) 交叉验证，您可以：

from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')

If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it).如果你不想继续上面的操作并且你想接近你的起始代码，那么你应该删除你的第 5 行代码，我想它会运行（如果你的测试集不包含你的目标变量，否则删除它）。 However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.但是在这种情况下，您将无法自行拆分训练测试，因为它已经拆分了，因此您的主要问题/帖子的标题应该更改。

拆分 scikit 的测试/训练数据？

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-09-23 06:08:27

解决方案2
0 2020-09-23 08:53:13

拆分 scikit 的测试/训练数据？

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-09-23 06:08:27

解决方案2 0 2020-09-23 08:53:13

解决方案1
0 已采纳 2020-09-23 06:08:27

解决方案2
0 2020-09-23 08:53:13