简体   繁体   English

Tensorflow:如何使用 tf 数据集构建高效的 NLP 管道

[英]Tensorflow : How to build efficient NLP pipeline using tf Dataset

I am working on TensorFlow and trying to create an efficient training and inference pipeline using tf.dataset API but facing some error :我正在使用 TensorFlow 并尝试使用 tf.dataset API 创建高效的训练和推理管道,但遇到了一些错误:

For example, a simple RNN network structure is like this :例如,一个简单的 RNN 网络结构是这样的:

import tensorflow as tf
import numpy as np
# hyper parameters
vocab_size          = 20
word_embedding_dim  = 100
batch_size          = 2



tf.reset_default_graph()
# placeholders
sentences             = tf.placeholder(tf.int32, [None,None], name='sentences')
targets               = tf.placeholder(tf.int32, [None, None], name='labels' )
keep_prob             = tf.placeholder(tf.float32, [1,], name='dropout')
keep_prob             = tf.cast(keep_prob.shape[0],tf.float32)


# embedding
word_embedding         = tf.get_variable(name='word_embedding_',
                                             shape=[vocab_size, word_embedding_dim],
                                             dtype=tf.float32,
                                             initializer = tf.contrib.layers.xavier_initializer())
embedding_lookup = tf.nn.embedding_lookup(word_embedding, sentences)



#  bilstm model
with tf.variable_scope('forward'):
    fr_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
    dropout_fr = tf.contrib.rnn.DropoutWrapper(fr_cell, output_keep_prob = 1. - keep_prob)

with tf.variable_scope('backward'):
    bw_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
    dropout_bw = tf.contrib.rnn.DropoutWrapper(bw_cell, output_keep_prob = 1. - keep_prob)

with tf.variable_scope('bi-lstm') as scope:
    model,last_state = tf.nn.bidirectional_dynamic_rnn(dropout_fr,
                                                       dropout_bw,
                                                       inputs=embedding_lookup,
                                                       dtype=tf.float32)

logits             = tf.transpose(tf.concat(model, 2), [1, 0, 2])[-1]
linear_projection  = tf.layers.dense(logits, 5)



#loss
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits = linear_projection, labels = tf.cast(targets,tf.float32))
loss = tf.reduce_mean(tf.reduce_sum(cross_entropy, axis=1))
optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(loss)

And dummy data is :虚拟数据是:

dummy_data    = [[1,3,4,5,5,12],[1,3,4,4,12,0],[12,4,12,0,0,0],[1,3,4,5,5,12]]
dummpy_labels = [[1,0,0,0,0],[0,1,0,1,0],[1,0,0,0,0],[0,1,0,1,0]]

Now How I typically train this network by defining slice and pad sequences manually :现在我通常如何通过手动定义切片和填充序列来训练这个网络:

#  pad and slice 


def get_train_data(batch_size, slice_no):

    batch_data_j = np.array(dummy_data[slice_no * batch_size:(slice_no + 1) * batch_size])
    batch_labels = np.array(dummpy_labels[slice_no * batch_size:(slice_no + 1) * batch_size])

    max_sequence = max(list(map(len, batch_data_j)))

    # getting Max length of sequence
    padded_sequence = [i + [0] * (max_sequence - len(i)) if len(i) < max_sequence else i for i in batch_data_j]
    return padded_sequence, batch_labels




# dropout 0.2 during training and 0.0 during inference
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = len(dummy_data) // batch_size

    for iter_ in range(iteration):

        sentences_, labels_ = get_train_data(2,iter_)
        loss_,_ = sess.run([loss,optimizer], feed_dict= {sentences: sentences_, targets: labels_, keep_prob : 0.2})
        print(loss_)

Now want to use tf dataset pipeline to build an efficient pipeline for training and inference.现在想使用 tf dataset 管道来构建一个高效的训练和推理管道。 I went through some tutorials but couldn't find good answer.我浏览了一些教程,但找不到好的答案。

I tried to use tf.dataset like :我尝试使用 tf.dataset 像:

dataset = tf.data.Dataset.from_tensor_slices((sentences,targets,keep_prob))
dataset = dataset.batch(batch_size)

iterator = tf.data.Iterator.from_structure(dataset.output_types)
iterator_initializer_ = iterator.make_initializer(dataset, name='initializer')
sentec, labels, drop_  = iterator.get_next()



def initialize_iterator(sess, sentences_, labels_, drops_):

        feed_dict = {sentences: sentences_, targets: labels_, keep_prob : [np.random.randint(0,2,[1,]).astype(np.float32)]}

        return sess.run(iterator_initializer_, feed_dict = feed_dict)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = len(dummy_data) // batch_size

    for iter_ in range(iteration):
        initialize_iterator(sess, dummy_data, dummpy_labels, [0.0])
        los, _ = sess.run([loss, optimizer])
        print(los)

But I am getting an error.但我收到一个错误。

So what should be an efficient pipeline for training RNN and encoding, padding with dropout sequences using dataset api?那么,使用数据集 api 训练 RNN 和编码、填充 dropout 序列的有效管道应该是什么?

I suggest you preparing your data into some other formats: for example, CSV or TFRecord .我建议您将数据准备为其他一些格式:例如CSVTFRecord Then you could use tf.data.experimental.make_csv_dataset or tf.data.TFRecordDataset to read the data into tf.data object directly.然后你可以使用tf.data.experimental.make_csv_datasettf.data.TFRecordDataset直接将数据读入 tf.data 对象。

There are tutorials on this topic here .有关于这个主题的教程这里

If you are using TFRecord , one example (in tensorflow.Example proto buffer text format) would look like this:如果您使用的是TFRecord ,一个示例(以tensorflow.Example proto 缓冲区文本格式)将如下所示:

features {
  feature {
    key: "sentences"
    value {
      int64_list {
        value: "0"
        value: "55"
        value: "128"
      }
    }
  }
  feature {
    key: "targets"
    value {
      int64_list {
        value: "10001"
        value: "10002"
      }
    }
  }
}

I would use keep_prob and batch_size as model configuration parameters.我会使用keep_probbatch_size作为模型配置参数。 You don't need to embed them in the examples.您不需要将它们嵌入到示例中。

Once you have your training and evaluation examples created in TFRecord format above and serialized, it would be straightforward to build your data pipeline.一旦您以上面的TFRecord格式创建了训练和评估示例并进行了序列化,就可以直接构建您的数据管道。

dataset = tf.data.TFRecordDataset(filenames = [your_tf_record_file])

Based on the dataset, you could build your input_fn , then you could continue with either tf.Estimator api or Keras API.根据数据集,您可以构建input_fn ,然后您可以继续使用tf.Estimator api 或tf.Estimator API。 One example tutorial is here .一个示例教程在这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用带有skflow / tf学习功能的Tensorflow输入管道 - Using a Tensorflow input pipeline with skflow/tf learn 在 TensorFlow 2.0 中使用 tf.Dataset 进行训练 - Training using tf.Dataset in TensorFlow 2.0 如何在tensorflow中使用tf.RandomShuffleQueue和tf.train.shuffle_batch制作输入管道? - How to make input pipeline using tf.RandomShuffleQueue and tf.train.shuffle_batch in tensorflow? 如何使用数据集管道推断张量流中的单个示例 - How to inference with a single example in tensorflow with dataset pipeline 如何规范化 TensorFlow `Dataset` 管道? - how does one normalize a TensorFlow `Dataset` pipeline? 如何使用自己的数据集图像在 tensorflow 中构建自动编码器? - How to build an autoencoder in tensorflow using own dataset images? 使用 XLM 数据集的 NLP - NLP using XLM dataset 如何使用DataSet API在Tensorflow中为tf.train.SequenceExample数据创建填充批次? - How do I create padded batches in Tensorflow for tf.train.SequenceExample data using the DataSet API? Tensorflow 使用 tf.data.Dataset 降低性能 - Tensorflow slow performances using tf.data.Dataset Tensorflow:FailedPreconditionError:表未初始化(使用tf.data.Dataset API) - Tensorflow: FailedPreconditionError: Table not initialized (using tf.data.Dataset API)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM