Tensorflow：如何使用 tf 数据集构建高效的 NLP 管道

Question

I am working on TensorFlow and trying to create an efficient training and inference pipeline using tf.dataset API but facing some error :我正在使用 TensorFlow 并尝试使用 tf.dataset API 创建高效的训练和推理管道，但遇到了一些错误：

For example, a simple RNN network structure is like this :例如，一个简单的 RNN 网络结构是这样的：

import tensorflow as tf
import numpy as np
# hyper parameters
vocab_size          = 20
word_embedding_dim  = 100
batch_size          = 2



tf.reset_default_graph()
# placeholders
sentences             = tf.placeholder(tf.int32, [None,None], name='sentences')
targets               = tf.placeholder(tf.int32, [None, None], name='labels' )
keep_prob             = tf.placeholder(tf.float32, [1,], name='dropout')
keep_prob             = tf.cast(keep_prob.shape[0],tf.float32)


# embedding
word_embedding         = tf.get_variable(name='word_embedding_',
                                             shape=[vocab_size, word_embedding_dim],
                                             dtype=tf.float32,
                                             initializer = tf.contrib.layers.xavier_initializer())
embedding_lookup = tf.nn.embedding_lookup(word_embedding, sentences)



#  bilstm model
with tf.variable_scope('forward'):
    fr_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
    dropout_fr = tf.contrib.rnn.DropoutWrapper(fr_cell, output_keep_prob = 1. - keep_prob)

with tf.variable_scope('backward'):
    bw_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
    dropout_bw = tf.contrib.rnn.DropoutWrapper(bw_cell, output_keep_prob = 1. - keep_prob)

with tf.variable_scope('bi-lstm') as scope:
    model,last_state = tf.nn.bidirectional_dynamic_rnn(dropout_fr,
                                                       dropout_bw,
                                                       inputs=embedding_lookup,
                                                       dtype=tf.float32)

logits             = tf.transpose(tf.concat(model, 2), [1, 0, 2])[-1]
linear_projection  = tf.layers.dense(logits, 5)



#loss
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits = linear_projection, labels = tf.cast(targets,tf.float32))
loss = tf.reduce_mean(tf.reduce_sum(cross_entropy, axis=1))
optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(loss)

And dummy data is :虚拟数据是：

dummy_data    = [[1,3,4,5,5,12],[1,3,4,4,12,0],[12,4,12,0,0,0],[1,3,4,5,5,12]]
dummpy_labels = [[1,0,0,0,0],[0,1,0,1,0],[1,0,0,0,0],[0,1,0,1,0]]

Now How I typically train this network by defining slice and pad sequences manually :现在我通常如何通过手动定义切片和填充序列来训练这个网络：

#  pad and slice 


def get_train_data(batch_size, slice_no):

    batch_data_j = np.array(dummy_data[slice_no * batch_size:(slice_no + 1) * batch_size])
    batch_labels = np.array(dummpy_labels[slice_no * batch_size:(slice_no + 1) * batch_size])

    max_sequence = max(list(map(len, batch_data_j)))

    # getting Max length of sequence
    padded_sequence = [i + [0] * (max_sequence - len(i)) if len(i) < max_sequence else i for i in batch_data_j]
    return padded_sequence, batch_labels




# dropout 0.2 during training and 0.0 during inference
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = len(dummy_data) // batch_size

    for iter_ in range(iteration):

        sentences_, labels_ = get_train_data(2,iter_)
        loss_,_ = sess.run([loss,optimizer], feed_dict= {sentences: sentences_, targets: labels_, keep_prob : 0.2})
        print(loss_)

Now want to use tf dataset pipeline to build an efficient pipeline for training and inference.现在想使用 tf dataset 管道来构建一个高效的训练和推理管道。 I went through some tutorials but couldn't find good answer.我浏览了一些教程，但找不到好的答案。

I tried to use tf.dataset like :我尝试使用 tf.dataset 像：

dataset = tf.data.Dataset.from_tensor_slices((sentences,targets,keep_prob))
dataset = dataset.batch(batch_size)

iterator = tf.data.Iterator.from_structure(dataset.output_types)
iterator_initializer_ = iterator.make_initializer(dataset, name='initializer')
sentec, labels, drop_  = iterator.get_next()



def initialize_iterator(sess, sentences_, labels_, drops_):

        feed_dict = {sentences: sentences_, targets: labels_, keep_prob : [np.random.randint(0,2,[1,]).astype(np.float32)]}

        return sess.run(iterator_initializer_, feed_dict = feed_dict)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = len(dummy_data) // batch_size

    for iter_ in range(iteration):
        initialize_iterator(sess, dummy_data, dummpy_labels, [0.0])
        los, _ = sess.run([loss, optimizer])
        print(los)

But I am getting an error.但我收到一个错误。

So what should be an efficient pipeline for training RNN and encoding, padding with dropout sequences using dataset api?那么，使用数据集 api 训练 RNN 和编码、填充 dropout 序列的有效管道应该是什么？

Answer 1

I suggest you preparing your data into some other formats: for example, CSV or TFRecord .我建议您将数据准备为其他一些格式：例如CSV或TFRecord 。 Then you could use tf.data.experimental.make_csv_dataset or tf.data.TFRecordDataset to read the data into tf.data object directly.然后你可以使用tf.data.experimental.make_csv_dataset或tf.data.TFRecordDataset直接将数据读入 tf.data 对象。

There are tutorials on this topic here .有关于这个主题的教程这里。

If you are using TFRecord , one example (in tensorflow.Example proto buffer text format) would look like this:如果您使用的是TFRecord ，一个示例（以tensorflow.Example proto 缓冲区文本格式）将如下所示：

features {
  feature {
    key: "sentences"
    value {
      int64_list {
        value: "0"
        value: "55"
        value: "128"
      }
    }
  }
  feature {
    key: "targets"
    value {
      int64_list {
        value: "10001"
        value: "10002"
      }
    }
  }
}

I would use keep_prob and batch_size as model configuration parameters.我会使用keep_prob和batch_size作为模型配置参数。 You don't need to embed them in the examples.您不需要将它们嵌入到示例中。

Once you have your training and evaluation examples created in TFRecord format above and serialized, it would be straightforward to build your data pipeline.一旦您以上面的TFRecord格式创建了训练和评估示例并进行了序列化，就可以直接构建您的数据管道。

dataset = tf.data.TFRecordDataset(filenames = [your_tf_record_file])

Based on the dataset, you could build your input_fn , then you could continue with either tf.Estimator api or Keras API.根据数据集，您可以构建input_fn ，然后您可以继续使用tf.Estimator api 或tf.Estimator API。 One example tutorial is here .一个示例教程在这里。

Tensorflow：如何使用 tf 数据集构建高效的 NLP 管道

问题描述

1 个解决方案

解决方案1
0 2019-12-05 01:05:29

Tensorflow：如何使用 tf 数据集构建高效的 NLP 管道

问题描述

1 个解决方案

解决方案1 0 2019-12-05 01:05:29

解决方案1
0 2019-12-05 01:05:29