简体   繁体   English

如何将此数据集拆分为训练集和验证集?

[英]how to split this dataset into train and validation set?

I want to split dataset with sklearn because I don't think validation_split is working for me.我想用 sklearn 拆分数据集,因为我不认为 validation_split 对我有用。 Here's how im actually reading dataset:这是我实际阅读数据集的方式:

input_sentences = []
output_sentences = []
output_sentences_inputs = []    #Translated data

count = 0
for line in open(r'/content/drive/My Drive/TEMPPP/123.txt', encoding="utf-8"):
    count += 1

    if count > NUM_SENTENCES:
        break

    if '\t' not in line:
        continue

    input_sentence, output = line.rstrip().split('\t')

    output_sentence = output + ' <eos>'
    output_sentence_input = '<sos> ' + output

    input_sentences.append(input_sentence)
    output_sentences.append(output_sentence)
    output_sentences_inputs.append(output_sentence_input)

Now im confused how to use scikit learn.现在我很困惑如何使用 scikit learn。 For now this is what I did.现在这就是我所做的。

from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(input_sentences, output_sentences, test_size = 0.2, random_state = 1)

First thing, Is it the right approach????首先,这是正确的方法吗????

If no?如果不? then how do I make the split?那么我该如何进行拆分呢?

If yes?如果是? then help me with this confusion: I was passing input_sentences and output_sentences to my layers, now what do I need to pass????然后帮我解决这个困惑:我正在将 input_sentences 和 output_sentences 传递给我的图层,现在我需要传递什么???? Do I still pass input_sentences and output_sentences like before and train model with full dataset or do I need to only send xTrain and yTrain????我是否仍然像以前一样传递 input_sentences 和 output_sentences 并使用完整数据集训练 model 还是只需要发送 xTrain 和 yTrain?? And xTest and yTest will never be passed by layers, only used to validate?而xTest和yTest永远不会被层传,只用来验证?

Based on your code it seems that what you're currently doing is correct.根据您的代码,您当前所做的似乎是正确的。

No, you can forget about input_sentences and output_sentences , from now only use arrays that train_test_split creates.不,您可以忘记input_sentencesoutput_sentences ,从现在开始只使用train_test_split创建的 arrays 。

If you're using ML algorithm that has a fit() method - this one will use xTrain and yTrain .如果您使用的是具有fit()方法的 ML 算法 - 这个算法将使用xTrainyTrain The predict() method will take xTest and you use yTest to check accuracy of your predictions from predict() method. predict()方法将采用xTest并且您使用yTest来检查predict()方法的预测的准确性。 Say, calling sklearn.metrics.r2_score(predictions, yTest) .比如说,调用sklearn.metrics.r2_score(predictions, yTest)

Also please note that using multiple question marks to end your sentence makes it sound interrogatory and impolite .另请注意,使用多个问号来结束你的句子会使它听起来带有疑问和不礼貌 You're here asking for help so please watch your punctuation.你在这里寻求帮助,所以请注意你的标点符号。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将此数据集拆分为训练集、验证集和测试集? - How can I split this dataset into train, validation, and test set? 如何正确拆分包含训练测试和交叉验证集的不平衡数据集 - How to correctly split unbalanced dataset incorporating train test and cross validation set 如何在 Python 脚本中将 tensorflow 数据集拆分为训练、测试和验证? - How to split a tensorflow dataset into train, test and validation in a Python script? 如何将大数据集拆分为训练集、验证集和测试集 - How to split a big dataset into train, validation and testing sets 如何正确拆分不平衡数据集以训练和测试集? - How can I properly split imbalanced dataset to train and test set? 如何使用 pytorch 将数据集拆分为自定义训练集和自定义验证集? - How to split a dataset into a custom training set and a custom validation set with pytorch? 如何使用 Python Numpy 中的 train_test_split 将数据拆分为训练、测试和验证数据集? 分裂不应该是随机的 - How to split data by using train_test_split in Python Numpy into train, test and validation data set? The split should not random StratifiedKFold 拆分训练和验证集大小 - StratifiedKFold split train and validation set size 如何在 tf 2.1.0 中创建 tf.data.Dataset 的训练、测试和验证拆分 - how to create train, test & validation split of tf.data.Dataset in tf 2.1.0 带有交叉验证的训练集拆分和测试集拆分的分数 - Scores for train set split and ​test set split with cross validation
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM