I'm very new to tensorflow. I've attended an online course, but I still have many questions related to data pre-processing. I would really appreciate if someone could help me out!!
My goal is to train a model that classifies Portuguese nouns into two gender categories (feminine and masculine) based on their internal structure. So, for this, I have a file containing about 4300 nouns and their categories (F and M labels).
First question: I have opened the nouns files and I first tokenized the words and after that I have padded them. I have a put the labels in a separated file. The labels file is a txt list containing the labels 'f' and 'm'. I've converted them into 0 and 1 integers and then convert them into a numpy array. I've also converted the padded nouns into a numpy array. Is that correct?
What is strange is that I have set the number of epochs for 100, but the program keeps training…
Second question:
How can I separate my train and labels into test and test_labels?
My code so far is below:
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize,wordpunct_tokenize
import re
import os
import sys
from pathlib import Path
import numpy as np
import numpy
import tensorflow as tf
while True:
try:
file_to_open =Path(input("Please, insert your file path: "))
with open(file_to_open,'r', encoding="utf-8") as f:
words = f.read()
break
except FileNotFoundError:
print("\nFile not found. Better try again")
except IsADirectoryError:
print("\nIncorrect Directory path.Try again")
corpus=words.split('\n')
labels = []
new_labels=[]
nouns = []
for i in corpus:
if i == '0':
labels.append(i)
elif i== '1':
labels.append(i)
else:
nouns.append(i)
for x in labels:
new_labels.append(int(x))
training_labels= numpy.array(new_labels)
training_nouns=[]
for w in nouns:
a=list(w)
b=' '.join([str(elem) for elem in a]) + ',' + ' '
training_nouns.append(b)
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_nouns)
word_index = tokenizer.word_index
nouns_sequences = tokenizer.texts_to_sequences(training_nouns)
padded = pad_sequences(nouns_sequences,maxlen=max_length)
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
input_length=max_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(36, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=.
['accuracy'])
model.summary()
training_padded = np.array(padded)
num_epochs = 150
model.fit(training_padded, training_labels, epochs=num_epochs)
If you shouldn't use Tensorflow. you can use train_test_split
scikit-learn
function like this(you can continue with tensorflow):
from sklearn.model_selection import train_test_split
train_data,train_labels,test_data,test_labels=train_test_split(YOUR DATA,YOUR LABELS)
see here for more information.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.