简体   繁体   English

如何使用 tensorflow 将数据拆分为测试和训练

[英]how to split data into test and train using tensorflow

I'm very new to tensorflow.我对张量流很陌生。 I've attended an online course, but I still have many questions related to data pre-processing.我参加了一个在线课程,但我仍然有很多与数据预处理相关的问题。 I would really appreciate if someone could help me out!!如果有人能帮助我,我将不胜感激!!

My goal is to train a model that classifies Portuguese nouns into two gender categories (feminine and masculine) based on their internal structure.我的目标是训练一个模型,根据内部结构将葡萄牙语名词分为两个性别类别(女性和男性)。 So, for this, I have a file containing about 4300 nouns and their categories (F and M labels).因此,为此,我有一个包含大约 4300 个名词及其类别(F 和 M 标签)的文件。

First question: I have opened the nouns files and I first tokenized the words and after that I have padded them.第一个问题:我打开了名词文件,我首先标记了单词,然后填充了它们。 I have a put the labels in a separated file.我有一个把标签放在一个单独的文件中。 The labels file is a txt list containing the labels 'f' and 'm'.标签文件是一个 txt 列表,其中包含标签“f”和“m”。 I've converted them into 0 and 1 integers and then convert them into a numpy array.我已将它们转换为 0 和 1 整数,然后将它们转换为一个 numpy 数组。 I've also converted the padded nouns into a numpy array.我还将填充的名词转换为 numpy 数组。 Is that correct?那是对的吗?

What is strange is that I have set the number of epochs for 100, but the program keeps training…奇怪的是我把epoch的数量设置为100,但是程序一直在训练……

Second question:第二个问题:

How can I separate my train and labels into test and test_labels?如何将我的火车和标签分成 test 和 test_labels?

My code so far is below:到目前为止,我的代码如下:

from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize,wordpunct_tokenize
import re
import os
import sys
from pathlib import Path
import numpy as np
import numpy
import tensorflow as tf

while True:
    try:
        file_to_open =Path(input("Please, insert your file path: "))
        with open(file_to_open,'r', encoding="utf-8") as f:
            words = f.read()
            break         
    except FileNotFoundError:
        print("\nFile not found. Better try again")
    except IsADirectoryError:
        print("\nIncorrect Directory path.Try again")

corpus=words.split('\n')

labels = []
new_labels=[]
nouns = []
for i in corpus:
    if i == '0':
        labels.append(i)
    elif i== '1':
        labels.append(i)
    else:
        nouns.append(i)

for x in labels:
    new_labels.append(int(x))


training_labels= numpy.array(new_labels)

training_nouns=[]

for w in nouns:
    a=list(w)
    b=' '.join([str(elem) for elem in a]) + ',' + ' '
    training_nouns.append(b)

vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_nouns)
word_index = tokenizer.word_index
nouns_sequences = tokenizer.texts_to_sequences(training_nouns)
padded = pad_sequences(nouns_sequences,maxlen=max_length)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, 
input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(36, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=. 
 ['accuracy'])
model.summary()


training_padded = np.array(padded)

num_epochs = 150
model.fit(training_padded, training_labels, epochs=num_epochs)

If you shouldn't use Tensorflow.如果你不应该使用 Tensorflow。 you can use train_test_split scikit-learn function like this(you can continue with tensorflow):您可以像这样使用train_test_split scikit-learn函数(您可以继续使用 tensorflow):

from sklearn.model_selection import train_test_split
train_data,train_labels,test_data,test_labels=train_test_split(YOUR DATA,YOUR LABELS)

see here for more information.请参阅此处了解更多信息。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 TensorFlow 中使用我自己的数据将图像拆分为测试集和训练集 - How to split images into test and train set using my own data in TensorFlow 如何使用 Tensorflow 设置分类标签并将数据拆分为训练、测试和开发拆分? - How do i set the categorical labels and split the data into train,test and dev splits using Tensorflow? 如何使用train_test_split将未标记的数据拆分为训练集和测试集? - How to split unlabeled data into train and test set using train_test_split? 如何在不使用 function train_test_split 的情况下将数据拆分为测试和训练? - How can I split the data into test and train without using function train_test_split? 如何使用 Python Numpy 中的 train_test_split 将数据拆分为训练、测试和验证数据集? 分裂不应该是随机的 - How to split data by using train_test_split in Python Numpy into train, test and validation data set? The split should not random TensorFlow 数据集训练/测试拆分 - TensorFlow Dataset train/test split 想要从具有张量流的csv中分离出列车和测试数据 - Want to split train and test data gotten from a csv with tensorflow 拆分训练/测试数据 - Split train / test data 如何根据标签训练/测试/拆分数据? - How to train/test/split data based on labels? 如何在 Python 脚本中将 tensorflow 数据集拆分为训练、测试和验证? - How to split a tensorflow dataset into train, test and validation in a Python script?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM