繁体   English   中英

Tensorflow:如何使用 tf 数据集构建高效的 NLP 管道

[英]Tensorflow : How to build efficient NLP pipeline using tf Dataset

我正在使用 TensorFlow 并尝试使用 tf.dataset API 创建高效的训练和推理管道,但遇到了一些错误:

例如,一个简单的 RNN 网络结构是这样的:

import tensorflow as tf
import numpy as np
# hyper parameters
vocab_size          = 20
word_embedding_dim  = 100
batch_size          = 2



tf.reset_default_graph()
# placeholders
sentences             = tf.placeholder(tf.int32, [None,None], name='sentences')
targets               = tf.placeholder(tf.int32, [None, None], name='labels' )
keep_prob             = tf.placeholder(tf.float32, [1,], name='dropout')
keep_prob             = tf.cast(keep_prob.shape[0],tf.float32)


# embedding
word_embedding         = tf.get_variable(name='word_embedding_',
                                             shape=[vocab_size, word_embedding_dim],
                                             dtype=tf.float32,
                                             initializer = tf.contrib.layers.xavier_initializer())
embedding_lookup = tf.nn.embedding_lookup(word_embedding, sentences)



#  bilstm model
with tf.variable_scope('forward'):
    fr_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
    dropout_fr = tf.contrib.rnn.DropoutWrapper(fr_cell, output_keep_prob = 1. - keep_prob)

with tf.variable_scope('backward'):
    bw_cell = tf.contrib.rnn.LSTMCell(num_units = 15)
    dropout_bw = tf.contrib.rnn.DropoutWrapper(bw_cell, output_keep_prob = 1. - keep_prob)

with tf.variable_scope('bi-lstm') as scope:
    model,last_state = tf.nn.bidirectional_dynamic_rnn(dropout_fr,
                                                       dropout_bw,
                                                       inputs=embedding_lookup,
                                                       dtype=tf.float32)

logits             = tf.transpose(tf.concat(model, 2), [1, 0, 2])[-1]
linear_projection  = tf.layers.dense(logits, 5)



#loss
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits = linear_projection, labels = tf.cast(targets,tf.float32))
loss = tf.reduce_mean(tf.reduce_sum(cross_entropy, axis=1))
optimizer = tf.train.AdamOptimizer(learning_rate = 0.001).minimize(loss)

虚拟数据是:

dummy_data    = [[1,3,4,5,5,12],[1,3,4,4,12,0],[12,4,12,0,0,0],[1,3,4,5,5,12]]
dummpy_labels = [[1,0,0,0,0],[0,1,0,1,0],[1,0,0,0,0],[0,1,0,1,0]]

现在我通常如何通过手动定义切片和填充序列来训练这个网络:

#  pad and slice 


def get_train_data(batch_size, slice_no):

    batch_data_j = np.array(dummy_data[slice_no * batch_size:(slice_no + 1) * batch_size])
    batch_labels = np.array(dummpy_labels[slice_no * batch_size:(slice_no + 1) * batch_size])

    max_sequence = max(list(map(len, batch_data_j)))

    # getting Max length of sequence
    padded_sequence = [i + [0] * (max_sequence - len(i)) if len(i) < max_sequence else i for i in batch_data_j]
    return padded_sequence, batch_labels




# dropout 0.2 during training and 0.0 during inference
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = len(dummy_data) // batch_size

    for iter_ in range(iteration):

        sentences_, labels_ = get_train_data(2,iter_)
        loss_,_ = sess.run([loss,optimizer], feed_dict= {sentences: sentences_, targets: labels_, keep_prob : 0.2})
        print(loss_)

现在想使用 tf dataset 管道来构建一个高效的训练和推理管道。 我浏览了一些教程,但找不到好的答案。

我尝试使用 tf.dataset 像:

dataset = tf.data.Dataset.from_tensor_slices((sentences,targets,keep_prob))
dataset = dataset.batch(batch_size)

iterator = tf.data.Iterator.from_structure(dataset.output_types)
iterator_initializer_ = iterator.make_initializer(dataset, name='initializer')
sentec, labels, drop_  = iterator.get_next()



def initialize_iterator(sess, sentences_, labels_, drops_):

        feed_dict = {sentences: sentences_, targets: labels_, keep_prob : [np.random.randint(0,2,[1,]).astype(np.float32)]}

        return sess.run(iterator_initializer_, feed_dict = feed_dict)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = len(dummy_data) // batch_size

    for iter_ in range(iteration):
        initialize_iterator(sess, dummy_data, dummpy_labels, [0.0])
        los, _ = sess.run([loss, optimizer])
        print(los)

但我收到一个错误。

那么,使用数据集 api 训练 RNN 和编码、填充 dropout 序列的有效管道应该是什么?

我建议您将数据准备为其他一些格式:例如CSVTFRecord 然后你可以使用tf.data.experimental.make_csv_datasettf.data.TFRecordDataset直接将数据读入 tf.data 对象。

有关于这个主题的教程这里

如果您使用的是TFRecord ,一个示例(以tensorflow.Example proto 缓冲区文本格式)将如下所示:

features {
  feature {
    key: "sentences"
    value {
      int64_list {
        value: "0"
        value: "55"
        value: "128"
      }
    }
  }
  feature {
    key: "targets"
    value {
      int64_list {
        value: "10001"
        value: "10002"
      }
    }
  }
}

我会使用keep_probbatch_size作为模型配置参数。 您不需要将它们嵌入到示例中。

一旦您以上面的TFRecord格式创建了训练和评估示例并进行了序列化,就可以直接构建您的数据管道。

dataset = tf.data.TFRecordDataset(filenames = [your_tf_record_file])

根据数据集,您可以构建input_fn ,然后您可以继续使用tf.Estimator api 或tf.Estimator API。 一个示例教程在这里

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM