[英]How to preprocess tensorflow imdb_review dataset
I am using tensorflow imdb_review dataset , and I want to preprocess it using Tokenizer and pad_sequences我正在使用tensorflow imdb_review dataset ,我想使用Tokenizer和pad_sequences对其进行预处理
When I am using the Tokenizer instance and using the following code:当我使用Tokenizer实例并使用以下代码时:
tokenizer=Tokenizer(num_words=100)
tokenizer.fit_on_texts(df['text'])
word_index = tokenizer.word_index
sequences=tokenizer.texts_to_sequences(df['text'])
print(word_index)
print(sequences)
I am getting the error TypeError: a bytes-like object is required, not 'dict'我收到错误TypeError: a bytes-like object is required, not 'dict'
What I've tried我试过的
store dataset as dataframe and then iterate over the text column, and store it in a list, and then tokenize it.将数据集存储为 dataframe 然后遍历文本列,并将其存储在列表中,然后对其进行标记。
df = tfds.as_dataframe(ds.take(4), info)
# list to store corpus
corpus = []
for sentences in df['text'].iteritems():
corpus.append(sentences)
tokenizer=Tokenizer(num_words=100)
tokenizer.fit_on_texts(corpus)
word_index=tokenizer.word_index
print(word_index)
But i'm getting the error AttributeError: 'tuple' object has no attribute 'lower'但我收到错误AttributeError: 'tuple' object has no attribute 'lower'
How can I use the 'text' column and preprocess it to feed it to my neural network?如何使用“文本”列并对其进行预处理以将其提供给我的神经网络?
All you need to convert the ['text']
column into numpy
first followed by necessary tokenization and padding.您需要先将['text']
列转换为numpy
,然后是必要的标记化和填充。 Below is the full working code.以下是完整的工作代码。 Enjoy.享受。
DataSet数据集
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# get the data first
imdb = tfds.load('imdb_reviews', as_supervised=True)
Data Prepare数据准备
# we will only take train_data (for demonstration purpose)
# do the same for test_data in your case
train_data, test_data = imdb['train'], imdb['test']
training_sentences = []
training_labels = []
for sentence, label in train_data:
training_sentences.append(str(sentence.numpy()))
training_labels.append(str(label.numpy()))
training_labels_final = np.array(training_labels).astype(np.float)
print(training_sentences[0]) # first samples
print(training_labels_final[0]) # first label
# b"This was an absolutely terrible movie. ...."
# 0.0
Preprocess - Tokenizer + Padding预处理 - 标记器 + 填充
vocab_size = 2000 # The maximum number of words to keep, based on word frequency.
embed_size = 30 # Dimension of the dense embedding.
max_len = 100 # Length of input sequences, when it is constant.
# https://keras.io/api/preprocessing/text/
tokenizer = Tokenizer(num_words=vocab_size,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
split=" ",
oov_token="<OOV>")
tokenizer.fit_on_texts(training_sentences)
print(tokenizer.word_index)
# {'<OOV>': 1, 'the': 2, 'and': 3, 'a': 4, 'of': 5, 'to': 6, 'is': 7, ...
# tokenized and padding
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_len, truncating='post')
print(training_sentences[0])
print()
print(training_padded[0])
# b"This was an absolutely terrible movie. ...."
#
[ 59 12 14 35 439 400 18 174 29 1 9 33 1378 1
42 496 1 197 25 88 156 19 12 211 340 29 70 248
213 9 486 62 70 88 116 99 24 1 12 1 657 777
12 18 7 35 406 1 178 1 426 2 92 1253 140 72
149 55 2 1 1 72 229 70 1 16 1 1 1 1
1506 1 3 40 1 119 1608 17 1 14 163 19 4 1253
927 1 9 4 18 13 14 1 5 102 148 1237 11 240
692 13]
Model Model
Sample Model.样品 Model。
# Input for variable-length sequences of integers
inputs = tf.keras.Input(shape=(None,), dtype="int32")
# Embed each integer
x = tf.keras.layers.Embedding(input_dim = vocab_size,
output_dim = embed_size,
input_length=max_len)(inputs)
# Add 2 bidirectional LSTMs
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(x)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)
# Add a classifier
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
# Compile and Run
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(training_padded,
training_labels_final,
epochs=10,
verbose=1)
Epoch 1/10
782/782 [==============================] - 25s 18ms/step - loss: 0.5548 - accuracy: 0.6915
Epoch 2/10
782/782 [==============================] - 14s 18ms/step - loss: 0.3921 - accuracy: 0.8248
...
782/782 [==============================] - 14s 18ms/step - loss: 0.2171 - accuracy: 0.9121
Epoch 9/10
782/782 [==============================] - 14s 17ms/step - loss: 0.1807 - accuracy: 0.9275
Epoch 10/10
782/782 [==============================] - 14s 18ms/step - loss: 0.1486 - accuracy: 0.9428
You can convert the df[ 'text' ]
column to a NumPy array by calling the to_numpy()
method.您可以通过调用to_numpy()
方法将df[ 'text' ]
列转换为 NumPy 数组。 See the docs here .请参阅此处的文档。 Also, consider the docs for Tokenizer.fit_on_texts
from here .另外,请从此处考虑Tokenizer.fit_on_texts
的文档。
corpus = df[ 'text' ].to_numpy()
tokenizer = Tokenizer( num_words=100 )
tokenizer.fit_on_texts(corpus)
The Tokenizer.fit_on_texts
method calls text_elem.lower()
internally. Tokenizer.fit_on_texts
方法在内部调用text_elem.lower()
。 But since you're not providing it with a list of String
, you're getting an exception.但是由于您没有为它提供String
列表,因此您会遇到异常。 Here's a snippet from the source .这是来自源代码的片段。
...
for text in texts:
self.document_count += 1
if self.char_level or isinstance(text, list):
if self.lower:
if isinstance(text, list):
text = [text_elem.lower() for text_elem in text]
else:
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.