I am working on TensorFlow and trying to create an efficient training and inference pipeline using tf.dataset API but facing some error :
For example, a simple RNN network structure is like this :
import tensorflow as tf
import numpy as np
# hyper parameters
vocab_size = 20
word_embedding_dim = 100
batch_size = 2
tf.reset_default_graph()
# placeholders
sentences = tf.placeholder(tf.int32, [None,None], name='sentences')
targets = tf.placeholder(tf.int32, [None, None], name='labels' )
keep_prob = tf.placeholder(tf.float32, [1,], name='dropout')
keep_prob = tf.cast(keep_prob.shape[0],tf.float32)
# embedding
word_embedding = tf.get_variable(name='word_embedding_',
shape=[vocab_size, word_embedding_dim],
dtype=tf.float32,
initializer = tf.contrib.layers.xavier_initializer())
embedding_lookup = tf.nn.embedding_lookup(word_embedding, sentences)
# bilstm model
with tf.variable_scope('forward'):
fr_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
dropout_fr = tf.contrib.rnn.DropoutWrapper(fr_cell, output_keep_prob = 1. - keep_prob)
with tf.variable_scope('backward'):
bw_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
dropout_bw = tf.contrib.rnn.DropoutWrapper(bw_cell, output_keep_prob = 1. - keep_prob)
with tf.variable_scope('bi-lstm') as scope:
model,last_state = tf.nn.bidirectional_dynamic_rnn(dropout_fr,
dropout_bw,
inputs=embedding_lookup,
dtype=tf.float32)
logits = tf.transpose(tf.concat(model, 2), [1, 0, 2])[-1]
linear_projection = tf.layers.dense(logits, 5)
#loss
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits = linear_projection, labels = tf.cast(targets,tf.float32))
loss = tf.reduce_mean(tf.reduce_sum(cross_entropy, axis=1))
optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(loss)
And dummy data is :
dummy_data = [[1,3,4,5,5,12],[1,3,4,4,12,0],[12,4,12,0,0,0],[1,3,4,5,5,12]]
dummpy_labels = [[1,0,0,0,0],[0,1,0,1,0],[1,0,0,0,0],[0,1,0,1,0]]
Now How I typically train this network by defining slice and pad sequences manually :
# pad and slice
def get_train_data(batch_size, slice_no):
batch_data_j = np.array(dummy_data[slice_no * batch_size:(slice_no + 1) * batch_size])
batch_labels = np.array(dummpy_labels[slice_no * batch_size:(slice_no + 1) * batch_size])
max_sequence = max(list(map(len, batch_data_j)))
# getting Max length of sequence
padded_sequence = [i + [0] * (max_sequence - len(i)) if len(i) < max_sequence else i for i in batch_data_j]
return padded_sequence, batch_labels
# dropout 0.2 during training and 0.0 during inference
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
iteration = len(dummy_data) // batch_size
for iter_ in range(iteration):
sentences_, labels_ = get_train_data(2,iter_)
loss_,_ = sess.run([loss,optimizer], feed_dict= {sentences: sentences_, targets: labels_, keep_prob : 0.2})
print(loss_)
Now want to use tf dataset pipeline to build an efficient pipeline for training and inference. I went through some tutorials but couldn't find good answer.
I tried to use tf.dataset like :
dataset = tf.data.Dataset.from_tensor_slices((sentences,targets,keep_prob))
dataset = dataset.batch(batch_size)
iterator = tf.data.Iterator.from_structure(dataset.output_types)
iterator_initializer_ = iterator.make_initializer(dataset, name='initializer')
sentec, labels, drop_ = iterator.get_next()
def initialize_iterator(sess, sentences_, labels_, drops_):
feed_dict = {sentences: sentences_, targets: labels_, keep_prob : [np.random.randint(0,2,[1,]).astype(np.float32)]}
return sess.run(iterator_initializer_, feed_dict = feed_dict)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
iteration = len(dummy_data) // batch_size
for iter_ in range(iteration):
initialize_iterator(sess, dummy_data, dummpy_labels, [0.0])
los, _ = sess.run([loss, optimizer])
print(los)
But I am getting an error.
So what should be an efficient pipeline for training RNN and encoding, padding with dropout sequences using dataset api?
I suggest you preparing your data into some other formats: for example, CSV
or TFRecord
. Then you could use tf.data.experimental.make_csv_dataset
or tf.data.TFRecordDataset
to read the data into tf.data object directly.
There are tutorials on this topic here .
If you are using TFRecord
, one example (in tensorflow.Example
proto buffer text format) would look like this:
features {
feature {
key: "sentences"
value {
int64_list {
value: "0"
value: "55"
value: "128"
}
}
}
feature {
key: "targets"
value {
int64_list {
value: "10001"
value: "10002"
}
}
}
}
I would use keep_prob
and batch_size
as model configuration parameters. You don't need to embed them in the examples.
Once you have your training and evaluation examples created in TFRecord
format above and serialized, it would be straightforward to build your data pipeline.
dataset = tf.data.TFRecordDataset(filenames = [your_tf_record_file])
Based on the dataset, you could build your input_fn
, then you could continue with either tf.Estimator
api or Keras API. One example tutorial is here .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.