如何将此数据集拆分为训练集和验证集？

Question

I want to split dataset with sklearn because I don't think validation_split is working for me.我想用 sklearn 拆分数据集，因为我不认为 validation_split 对我有用。 Here's how im actually reading dataset:这是我实际阅读数据集的方式：

input_sentences = []
output_sentences = []
output_sentences_inputs = []    #Translated data

count = 0
for line in open(r'/content/drive/My Drive/TEMPPP/123.txt', encoding="utf-8"):
    count += 1

    if count > NUM_SENTENCES:
        break

    if '\t' not in line:
        continue

    input_sentence, output = line.rstrip().split('\t')

    output_sentence = output + ' <eos>'
    output_sentence_input = '<sos> ' + output

    input_sentences.append(input_sentence)
    output_sentences.append(output_sentence)
    output_sentences_inputs.append(output_sentence_input)

Now im confused how to use scikit learn.现在我很困惑如何使用 scikit learn。 For now this is what I did.现在这就是我所做的。

from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(input_sentences, output_sentences, test_size = 0.2, random_state = 1)

First thing, Is it the right approach????首先，这是正确的方法吗？？？？

If no?如果不？ then how do I make the split?那么我该如何进行拆分呢？

If yes?如果是？ then help me with this confusion: I was passing input_sentences and output_sentences to my layers, now what do I need to pass????然后帮我解决这个困惑：我正在将 input_sentences 和 output_sentences 传递给我的图层，现在我需要传递什么？？？？ Do I still pass input_sentences and output_sentences like before and train model with full dataset or do I need to only send xTrain and yTrain????我是否仍然像以前一样传递 input_sentences 和 output_sentences 并使用完整数据集训练 model 还是只需要发送 xTrain 和 yTrain？？ And xTest and yTest will never be passed by layers, only used to validate?而xTest和yTest永远不会被层传，只用来验证？

Answer 1

Based on your code it seems that what you're currently doing is correct.根据您的代码，您当前所做的似乎是正确的。

No, you can forget about input_sentences and output_sentences , from now only use arrays that train_test_split creates.不，您可以忘记input_sentences和output_sentences ，从现在开始只使用train_test_split创建的 arrays 。

If you're using ML algorithm that has a fit() method - this one will use xTrain and yTrain .如果您使用的是具有fit()方法的 ML 算法 - 这个算法将使用xTrain和yTrain 。 The predict() method will take xTest and you use yTest to check accuracy of your predictions from predict() method. predict()方法将采用xTest并且您使用yTest来检查predict()方法的预测的准确性。 Say, calling sklearn.metrics.r2_score(predictions, yTest) .比如说，调用sklearn.metrics.r2_score(predictions, yTest) 。

Also please note that using multiple question marks to end your sentence makes it sound interrogatory and impolite .另请注意，使用多个问号来结束你的句子会使它听起来带有疑问和不礼貌。 You're here asking for help so please watch your punctuation.你在这里寻求帮助，所以请注意你的标点符号。

如何将此数据集拆分为训练集和验证集？

问题描述

1 个解决方案

解决方案1
0 2020-06-12 05:39:13

如何将此数据集拆分为训练集和验证集？

问题描述

1 个解决方案

解决方案1 0 2020-06-12 05:39:13

解决方案1
0 2020-06-12 05:39:13