[英]Tensorflow : How to build efficient NLP pipeline using tf Dataset
I am working on TensorFlow and trying to create an efficient training and inference pipeline using tf.dataset API but facing some error :我正在使用 TensorFlow 并尝试使用 tf.dataset API 创建高效的训练和推理管道,但遇到了一些错误:
For example, a simple RNN network structure is like this :例如,一个简单的 RNN 网络结构是这样的:
import tensorflow as tf
import numpy as np
# hyper parameters
vocab_size = 20
word_embedding_dim = 100
batch_size = 2
tf.reset_default_graph()
# placeholders
sentences = tf.placeholder(tf.int32, [None,None], name='sentences')
targets = tf.placeholder(tf.int32, [None, None], name='labels' )
keep_prob = tf.placeholder(tf.float32, [1,], name='dropout')
keep_prob = tf.cast(keep_prob.shape[0],tf.float32)
# embedding
word_embedding = tf.get_variable(name='word_embedding_',
shape=[vocab_size, word_embedding_dim],
dtype=tf.float32,
initializer = tf.contrib.layers.xavier_initializer())
embedding_lookup = tf.nn.embedding_lookup(word_embedding, sentences)
# bilstm model
with tf.variable_scope('forward'):
fr_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
dropout_fr = tf.contrib.rnn.DropoutWrapper(fr_cell, output_keep_prob = 1. - keep_prob)
with tf.variable_scope('backward'):
bw_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
dropout_bw = tf.contrib.rnn.DropoutWrapper(bw_cell, output_keep_prob = 1. - keep_prob)
with tf.variable_scope('bi-lstm') as scope:
model,last_state = tf.nn.bidirectional_dynamic_rnn(dropout_fr,
dropout_bw,
inputs=embedding_lookup,
dtype=tf.float32)
logits = tf.transpose(tf.concat(model, 2), [1, 0, 2])[-1]
linear_projection = tf.layers.dense(logits, 5)
#loss
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits = linear_projection, labels = tf.cast(targets,tf.float32))
loss = tf.reduce_mean(tf.reduce_sum(cross_entropy, axis=1))
optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(loss)
And dummy data is :虚拟数据是:
dummy_data = [[1,3,4,5,5,12],[1,3,4,4,12,0],[12,4,12,0,0,0],[1,3,4,5,5,12]]
dummpy_labels = [[1,0,0,0,0],[0,1,0,1,0],[1,0,0,0,0],[0,1,0,1,0]]
Now How I typically train this network by defining slice and pad sequences manually :现在我通常如何通过手动定义切片和填充序列来训练这个网络:
# pad and slice
def get_train_data(batch_size, slice_no):
batch_data_j = np.array(dummy_data[slice_no * batch_size:(slice_no + 1) * batch_size])
batch_labels = np.array(dummpy_labels[slice_no * batch_size:(slice_no + 1) * batch_size])
max_sequence = max(list(map(len, batch_data_j)))
# getting Max length of sequence
padded_sequence = [i + [0] * (max_sequence - len(i)) if len(i) < max_sequence else i for i in batch_data_j]
return padded_sequence, batch_labels
# dropout 0.2 during training and 0.0 during inference
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
iteration = len(dummy_data) // batch_size
for iter_ in range(iteration):
sentences_, labels_ = get_train_data(2,iter_)
loss_,_ = sess.run([loss,optimizer], feed_dict= {sentences: sentences_, targets: labels_, keep_prob : 0.2})
print(loss_)
Now want to use tf dataset pipeline to build an efficient pipeline for training and inference.现在想使用 tf dataset 管道来构建一个高效的训练和推理管道。 I went through some tutorials but couldn't find good answer.我浏览了一些教程,但找不到好的答案。
I tried to use tf.dataset like :我尝试使用 tf.dataset 像:
dataset = tf.data.Dataset.from_tensor_slices((sentences,targets,keep_prob))
dataset = dataset.batch(batch_size)
iterator = tf.data.Iterator.from_structure(dataset.output_types)
iterator_initializer_ = iterator.make_initializer(dataset, name='initializer')
sentec, labels, drop_ = iterator.get_next()
def initialize_iterator(sess, sentences_, labels_, drops_):
feed_dict = {sentences: sentences_, targets: labels_, keep_prob : [np.random.randint(0,2,[1,]).astype(np.float32)]}
return sess.run(iterator_initializer_, feed_dict = feed_dict)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
iteration = len(dummy_data) // batch_size
for iter_ in range(iteration):
initialize_iterator(sess, dummy_data, dummpy_labels, [0.0])
los, _ = sess.run([loss, optimizer])
print(los)
But I am getting an error.但我收到一个错误。
So what should be an efficient pipeline for training RNN and encoding, padding with dropout sequences using dataset api?那么,使用数据集 api 训练 RNN 和编码、填充 dropout 序列的有效管道应该是什么?
I suggest you preparing your data into some other formats: for example, CSV
or TFRecord
.我建议您将数据准备为其他一些格式:例如CSV
或TFRecord
。 Then you could use tf.data.experimental.make_csv_dataset
or tf.data.TFRecordDataset
to read the data into tf.data object directly.然后你可以使用tf.data.experimental.make_csv_dataset
或tf.data.TFRecordDataset
直接将数据读入 tf.data 对象。
There are tutorials on this topic here .有关于这个主题的教程这里。
If you are using TFRecord
, one example (in tensorflow.Example
proto buffer text format) would look like this:如果您使用的是TFRecord
,一个示例(以tensorflow.Example
proto 缓冲区文本格式)将如下所示:
features {
feature {
key: "sentences"
value {
int64_list {
value: "0"
value: "55"
value: "128"
}
}
}
feature {
key: "targets"
value {
int64_list {
value: "10001"
value: "10002"
}
}
}
}
I would use keep_prob
and batch_size
as model configuration parameters.我会使用keep_prob
和batch_size
作为模型配置参数。 You don't need to embed them in the examples.您不需要将它们嵌入到示例中。
Once you have your training and evaluation examples created in TFRecord
format above and serialized, it would be straightforward to build your data pipeline.一旦您以上面的TFRecord
格式创建了训练和评估示例并进行了序列化,就可以直接构建您的数据管道。
dataset = tf.data.TFRecordDataset(filenames = [your_tf_record_file])
Based on the dataset, you could build your input_fn
, then you could continue with either tf.Estimator
api or Keras API.根据数据集,您可以构建input_fn
,然后您可以继续使用tf.Estimator
api 或tf.Estimator
API。 One example tutorial is here .一个示例教程在这里。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.