![](/img/trans.png)
[英]Loss while training on bigger array becomes inf and then nan(Tensorflow)
[英]tensorflow loss is nan while training an RNN
使用单个GRU单元运行RNN,我遇到了以下堆栈跟踪的情况
Traceback (most recent call last):
File "language_model_test.py", line 15, in <module>
test_model()
File "language_model_test.py", line 12, in test_model
model.train(random_data, s)
File "/home/language_model/language_model.py", line 120, in train
train_pp = self._run_epoch(data, sess, inputs, rnn_ouputs, loss, trainOp, verbose)
File "/home/language_model/language_model.py", line 92, in _run_epoch
loss, _= sess.run([loss, trainOp], feed_dict=feed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 952, in _run
fetch_handler = _FetchHandler(self._graph, fetches, feed_dict_string)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 408, in __init__
self._fetch_mapper = _FetchMapper.for_fetch(fetches)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 230, in for_fetch
return _ListFetchMapper(fetch)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 337, in __init__
self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 238, in for_fetch
return _ElementFetchMapper(fetches, contraction_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 271, in __init__
% (fetch, type(fetch), str(e)))
TypeError: Fetch argument nan has invalid type <type 'numpy.float32'>, must be a string or Tensor. (Can not convert a float32 into a Tensor or Operation.)
计算损失的步骤似乎是问题所在
def train(self,data, session=tf.Session(), verbose=10):
print "initializing model"
self._add_placeholders()
inputs = self._add_embedding()
rnn_ouputs, _ = self._run_rnn(inputs)
outputs = self._projection_layer(rnn_ouputs)
loss = self._compute_loss(outputs)
trainOp = self._add_train_step(loss)
start = tf.initialize_all_variables()
saver = tf.train.Saver()
with session as sess:
sess.run(start)
for epoch in xrange(self._max_epochs):
train_pp = self._run_epoch(data, sess, inputs, rnn_ouputs, loss, trainOp, verbose)
print "Training preplexity for batch {} - {}".format(epoch, train_pp)
这是_run_epoch
的代码
与损失任何地方回来nan
def _run_epoch(self, data, session, inputs, rnn_ouputs, loss, trainOp, verbose=10):
with session.as_default() as sess:
total_steps = sum(1 for x in data_iterator(data, self._batch_size, self._max_steps))
train_loss = []
for step, (x,y, l) in enumerate(data_iterator(data, self._batch_size, self._max_steps)):
print "step - {0}".format(step)
feed = {
self.input_placeholder: x,
self.label_placeholder: y,
self.sequence_length: l,
self._dropout_placeholder: self._dropout,
}
loss, _= sess.run([loss, trainOp], feed_dict=feed)
print "loss - {0}".format(loss)
train_loss.append(loss)
if verbose and step % verbose == 0:
sys.stdout.write('\r{} / {} : pp = {}'. format(step, total_steps, np.exp(np.mean(train_loss))))
sys.stdout.flush()
if verbose:
sys.stdout.write('\r')
return np.exp(np.mean(train_loss))
当我通过对数据使用以下代码来测试我的代码时,出现了random_data = np.random.normal(0, 100, size=[42068, 46])
,该代码旨在模拟使用单词ID作为输入传递的对象。 我的其余代码可在以下要点中找到
编辑这是在出现此问题时我运行测试服的方式:
def test_model():
model = Language_model(vocab=range(0,101))
s = tf.Session()
#1 more than step size to acoomodate for the <eos> token at the end
random_data = np.random.normal(0, 100, size=[42068, 46])
# file = "./data/ptb.test.txt"
print "Fitting started"
model.train(random_data, s)
if __name__ == "__main__":
test_model()
如果我将random_data
替换为其他语言模型,则它们还将输出nan
作为费用。 我的理解是,张量流应该通过传入feed-dict来获取数值并检索对应于id的适当的嵌入向量,我不明白为什么random_data
导致其他模型的nan
。
上面的代码有几个问题
让我们从这一行开始
random_data = np.random.normal(0, 100, size=[42068, 46])
np.random.normal(...)
不会产生不同的值,而是会产生浮点值,让我们尝试上面的以下示例,但是大小可以控制。
>>> np.random.normal(0, 100, size=[5])
array([-53.12407229, 39.57335574, -98.25406749, 90.81471139, -41.05069646])
机器学习算法无法学习这些算法,因为它们是作为嵌入模型的输入的,并且我们得到了带有浮点值的负值。
实际需要的是以下代码:
random_data = np.random.randint(0, 101, size=...)
检查它的输出,我们得到
>>> np.random.randint(0, 100, size=[5])
array([27, 47, 33, 12, 24])
接下来,以下代码实际上在创建一个细微的问题。
def _run_epoch(self, data, session, inputs, rnn_ouputs, loss, train, verbose=10):
with session.as_default() as sess:
total_steps = sum(1 for x in data_iterator(data, self._batch_size, self._max_steps))
train_loss = []
for step, (x,y, l) in enumerate(data_iterator(data, self._batch_size, self._max_steps)):
print "step - {0}".format(step)
feed = {
self.input_placeholder: x,
self.label_placeholder: y,
self.sequence_length: l,
self._dropout_placeholder: self._dropout,
}
loss, _= sess.run([loss, train], feed_dict=feed)
print "loss - {0}".format(loss)
train_loss.append(loss)
if verbose and step % verbose == 0:
sys.stdout.write('\r{} / {} : pp = {}'. format(step, total_steps, np.exp(np.mean(train_loss))))
sys.stdout.flush()
if verbose:
sys.stdout.write('\r')
return np.exp(np.mean(train_loss))
loss
既是参数实参,又是变量,因此第一次运行时,它将不再是张量,因此我们实际上无法在会话中调用它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.