简体   繁体   English

Keras,级联多个RNN模型用于N维output

[英]Keras, cascade multiple RNN models for N-dimensional output

I'm having some difficulty with chaining together two models in an unusual way.我在以不寻常的方式将两个模型链接在一起时遇到了一些困难。

I am trying to replicate the following flowchart:我正在尝试复制以下流程图:

级联 RNN 2-D

For clarity, at each timestep of Model[0] I am attempting to generate an entire time series from IR[i] (Intermediate Representation) as a repeated input using Model[1] .为清楚起见,在Model[0]的每个时间步,我都尝试从IR[i] (中间表示)生成整个时间序列,作为使用Model[1]的重复输入。 The purpose of this scheme is it allows the generation of a ragged 2-D time series from a 1-D input (while both allowing the second model to be omitted when the output for that timestep is not needed, and not requiring Model[0] to constantly "switch modes" between accepting input, and generating output).该方案的目的是它允许从一维输入生成参差不齐的二维时间序列(同时在不需要该时间步长的 output 并且不需要Model[0]在接受输入和生成输出之间不断“切换模式”)。

I assume a custom training loop will be required, and I already have a custom training loop for handling statefulness in the first model (the previous version only had a single output at each timestep).我假设将需要一个自定义训练循环,并且我已经有一个自定义训练循环来处理第一个 model 中的状态(以前的版本在每个时间步只有一个 output)。 As depicted, the second model should have reasonably short outputs (able to be constrained to fewer than 10 timesteps).如图所示,第二个 model 应该具有相当短的输出(能够限制在少于 10 个时间步长)。

But at the end of the day, while I can wrap my head around what I want to do, I'm not nearly adroit enough with Keras and/or Tensorflow to actually implement it.但归根结底,虽然我可以围绕自己想做的事情展开思考,但我对 Keras 和/或 Tensorflow 的实际实施还不够熟练。 (In fact, this is my first non-toy project with the library.) (事实上,这是我第一个使用图书馆的非玩具项目。)

I have unsuccessfully searched literature for similar schemes to parrot, or example code to fiddle with.我没有成功地在文献中搜索类似鹦鹉的方案,或摆弄的示例代码。 And I don't even know if this idea is possible from within TF/Keras.而且我什至不知道这个想法是否可以在 TF/Keras 中实现。

I already have the two models working in isolation.我已经让这两个模型单独工作。 (As in I've worked out the dimensionality, and done some training with dummy data to get garbage outputs for the second model, and the first model is based off of a previous iteration of this problem and has been fully trained.) If I have Model[0] and Model[1] as python variables (let's call them model_a and model_b ), then how would I chain them together to do this? (正如我已经计算出维度,并使用虚拟数据进行了一些训练以获得第二个 model 的垃圾输出,而第一个 model 是基于该问题的先前迭代并且已经过充分训练。)如果我将Model[0]Model[1]作为 python 变量(我们称它们为model_amodel_b ),那么我将如何将它们链接在一起来做到这一点?

Edit to add:编辑添加:

If this is all unclear, perhaps having the dimensions of each input and output will help:如果这一切都不清楚,也许每个输入的尺寸和 output 会有所帮助:

The dimensions of each input and output are:每个输入和output的尺寸为:

Input: (batch_size, model_a_timesteps, input_size)输入: (batch_size, model_a_timesteps, input_size)
IR: (batch_size, model_a_timesteps, ir_size) IR: (batch_size, model_a_timesteps, ir_size)

IR[i] (after duplication): (batch_size, model_b_timesteps, ir_size) IR[i] (复制后): (batch_size, model_b_timesteps, ir_size)
Out[i]: (batch_size, model_b_timesteps, output_size) Out[i]: (batch_size, model_b_timesteps, output_size)
Out: (batch_size, model_a_timesteps, model_b_timesteps, output_size)输出: (batch_size, model_a_timesteps, model_b_timesteps, output_size)

As this question has multiple major parts, I've dedicated a Q&A to the core challenge: stateful backpropagation .由于这个问题有多个主要部分,因此我专门针对核心挑战进行了问答:有状态的反向传播 This answer focuses on implementing the variable output step length.这个答案的重点是实现变量 output 步长。


Description :说明

  • As validated in Case 5, we can take a bottom-up first approach.正如案例 5 中所验证的,我们可以采用自下而上的优先方法。 First we feed the complete input to model_a (A) - then, feed its outputs as input to model_b (B), but this time one step at a time .首先,我们将完整的输入提供给model_a (A) - 然后,将其输出作为输入提供给model_b (B),但这次是一步一步
  • Note that we must chain B's output steps per A's input step, not between A's input steps;请注意,我们必须在每个A 的输入步骤中链接 B 的 output 步骤,而不是A 的输入步骤之间; ie, in your diagram, gradient is to flow between Out[0][1] and Out[0][0] , but not between Out[2][0] and Out[0][1] .即,在您的图表中,梯度是在Out[0][1]Out[0][0]之间流动,但不在Out[2][0]Out[0][1]之间流动。
  • For computing loss it won't matter whether we use a ragged or padded tensor;对于计算损失,我们使用不规则张量还是填充张量都没有关系。 we must however use a padded tensor for writing to TensorArray.然而,我们必须使用填充张量来写入 TensorArray。
  • Loop logic in code below is general;下面代码中的循环逻辑是通用的; specific attribute handling and hidden state passing, however, is hard-coded for simplicity, but can be rewritten for generality.特定的属性处理和隐藏的 state 传递然而,为了简单起见是硬编码的,但为了通用性可以重写。

Code : at bottom.代码:在底部。


Example :示例

  • Here we predefine the number of iterations for B per input from A, but we can implement any arbitrary stopping logic.在这里,我们预定义了来自 A 的每个输入的 B 的迭代次数,但我们可以实现任意停止逻辑。 For example, we can take a Dense layer's output from B as a hidden state and check if its L2-norm exceeds a threshold.例如,我们可以将 B 中的Dense层的 output 作为隐藏的 state 并检查其 L2 范数是否超过阈值。
  • Per above, if longest_step is unknown to us, we can simply set it, which is common for NLP & other tasks with a STOP token.如上所述,如果我们不知道最长步长,我们可以简单地设置它,这对于longest_step和其他带有 STOP 令牌的任务很常见。
    • Alternatively, we may write to separate TensorArrays at every A's input with dynamic_size=True ;或者,我们可以使用dynamic_size=True在每个 A 的输入处写入单独的TensorArrays see "point of uncertainty" below.请参阅下面的“不确定性点”。
  • A valid concern is, how do we know gradients flow correctly?一个有效的问题是,我们如何知道梯度正确流动? Note that we've validate them for both vertical and horizontal in the linked Q&A, but it didn't cover multiple output steps per an input step, for multiple input steps.请注意,我们已经在链接的问答中验证了它们的垂直和水平,但它没有涵盖每个输入步骤的多个 output 步骤,用于多个输入步骤。 See below.见下文。

Point of uncertainty : I'm not entirely sure whether gradients interact between eg Out[0][1] and Out[2][0] .不确定性点:我不完全确定梯度是否在Out[0][1]Out[2][0]之间相互作用。 I did, however, verify that gradients will not flow horizontally if we write to separate TensorArray s for B's outputs per A's inputs (case 2);但是,我确实验证了如果我们为每个 A 的输入分别写入 B 的输出的TensorArray ,梯度不会水平流动(案例 2); reimplementing for cases 4 & 5, grads will differ for both models, including lower one with a complete single horizontal pass.重新实现案例 4 和 5,两种模型的 grads 会有所不同,包括具有完整单次水平传递的较低模型。

Thus we must write to a unified TensorArray .因此我们必须写入一个统一的TensorArray For such, as there are no ops leading from eg IR[1] to Out[0][1] , I can't see how TF would trace it as such - so it seems we're safe.为此,由于没有从例如IR[1]Out[0][1]的操作,我无法看到 TF 将如何跟踪它 - 所以看起来我们是安全的。 Note, however, that in below example, using steps_at_t=[1]*6 will make gradient flow in the both model horizontally, as we're writing to a single TensorArray and passing hidden states.但是请注意,在下面的示例中,使用steps_at_t=[1]*6将使梯度在 model 中水平流动,因为我们正在写入单个TensorArray并传递隐藏状态。

The examined case is confounded, however, with B being stateful at all steps;然而,被检查的案例是混淆的,因为 B 在所有步骤中都是有状态的; lifting this requirement, we might not need to write to a unified TensorArray for all Out[0] , Out[1] , etc, but we must still test against something we know works, which is no longer as straightforward.解除这个要求,我们可能不需要为所有Out[0]Out[1]等写入一个统一的TensorArray ,但我们仍然必须针对我们知道有效的东西进行测试,这不再那么简单了。


Example [code] :示例[代码]

import numpy as np
import tensorflow as tf

#%%# Make data & models, then fit ###########################################
x0 = y0 = tf.constant(np.random.randn(2, 3, 4))
msn = MultiStatefulNetwork(batch_shape=(2, 3, 4), steps_at_t=[3, 4, 2])

#%%#############################################
with tf.GradientTape(persistent=True) as tape:
    outputs = msn(x0)
    # shape: (3, 4, 2, 4), 0-padded
    # We can pad labels accordingly.
    # Note the (2, 4) model_b's output shape, which is a timestep slice;
    # model_b is a *slice model*. Careful in implementing various logics
    # which are and aren't intended to be stateful.

Methods :方法

Not the cleanest, nor most optimal code, but it works;不是最干净,也不是最优化的代码,但它可以工作; room for improvement.改进的余地。

More importantly: I implemented this in Eager, and have no idea how it'll work in Graph, and making it work for both can be quite tricky.更重要的是:我在 Eager 中实现了这一点,但不知道它在 Graph 中如何工作,并且使其适用于两者可能非常棘手。 If needed, just run in Graph and compare all values as done in the "cases".如果需要,只需在 Graph 中运行并比较“案例”中所做的所有值。

# ideally we won't `import tensorflow` at all; kept for code simplicity
import tensorflow as tf
from tensorflow.python.util import nest
from tensorflow.python.ops import array_ops, tensor_array_ops
from tensorflow.python.framework import ops

from tensorflow.keras.layers import Input, SimpleRNN, SimpleRNNCell
from tensorflow.keras.models import Model

#######################################################################
class MultiStatefulNetwork():
    def __init__(self, batch_shape=(2, 6, 4), steps_at_t=[]):
        self.batch_shape=batch_shape
        self.steps_at_t=steps_at_t

        self.batch_size = batch_shape[0]
        self.units = batch_shape[-1]
        self._build_models()

    def __call__(self, inputs):
        outputs = self._forward_pass_a(inputs)
        outputs = self._forward_pass_b(outputs)
        return outputs

    def _forward_pass_a(self, inputs):
        return self.model_a(inputs, training=True)

    def _forward_pass_b(self, inputs):
        return model_rnn_outer(self.model_b, inputs, self.steps_at_t)

    def _build_models(self):
        ipt = Input(batch_shape=self.batch_shape)
        out = SimpleRNN(self.units, return_sequences=True)(ipt)
        self.model_a = Model(ipt, out)

        ipt  = Input(batch_shape=(self.batch_size, self.units))
        sipt = Input(batch_shape=(self.batch_size, self.units))
        out, state = SimpleRNNCell(4)(ipt, sipt)
        self.model_b = Model([ipt, sipt], [out, state])

        self.model_a.compile('sgd', 'mse')
        self.model_b.compile('sgd', 'mse')


def inner_pass(model, inputs, states):
    return model_rnn(model, inputs, states)


def model_rnn_outer(model, inputs, steps_at_t=[2, 2, 4, 3]):
    def outer_step_function(inputs, states):
        x, steps = inputs
        x = array_ops.expand_dims(x, 0)
        x = array_ops.tile(x, [steps, *[1] * (x.ndim - 1)])  # repeat steps times
        output, new_states = inner_pass(model, x, states)
        return output, new_states

    (outer_steps, steps_at_t, longest_step, outer_t, initial_states,
     output_ta, input_ta) = _process_args_outer(model, inputs, steps_at_t)

    def _outer_step(outer_t, output_ta_t, *states):
        current_input = [input_ta.read(outer_t), steps_at_t.read(outer_t)]
        output, new_states = outer_step_function(current_input, tuple(states))

        # pad if shorter than longest_step.
        # model_b may output twice, but longest in `steps_at_t` is 4; then we need
        # output.shape == (2, *model_b.output_shape) -> (4, *...)
        # checking directly on `output` is more reliable than from `steps_at_t`
        output = tf.cond(
            tf.math.less(output.shape[0], longest_step),
            lambda: tf.pad(output, [[0, longest_step - output.shape[0]],
                                    *[[0, 0]] * (output.ndim - 1)]),
            lambda: output)

        output_ta_t = output_ta_t.write(outer_t, output)
        return (outer_t + 1, output_ta_t) + tuple(new_states)

    final_outputs = tf.while_loop(
        body=_outer_step,
        loop_vars=(outer_t, output_ta) + initial_states,
        cond=lambda outer_t, *_: tf.math.less(outer_t, outer_steps))

    output_ta = final_outputs[1]
    outputs = output_ta.stack()
    return outputs


def _process_args_outer(model, inputs, steps_at_t):
    def swap_batch_timestep(input_t):
        # Swap the batch and timestep dim for the incoming tensor.
        # (samples, timesteps, channels) -> (timesteps, samples, channels)
        # iterating dim0 to feed (samples, channels) slices expected by RNN
        axes = list(range(len(input_t.shape)))
        axes[0], axes[1] = 1, 0
        return array_ops.transpose(input_t, axes)

    inputs = nest.map_structure(swap_batch_timestep, inputs)

    assert inputs.shape[0] == len(steps_at_t)
    outer_steps = array_ops.shape(inputs)[0]  # model_a_steps
    longest_step = max(steps_at_t)
    steps_at_t = tensor_array_ops.TensorArray(
        dtype=tf.int32, size=len(steps_at_t)).unstack(steps_at_t)

    # assume single-input network, excluding states which are handled separately
    input_ta = tensor_array_ops.TensorArray(
        dtype=inputs.dtype,
        size=outer_steps,
        element_shape=tf.TensorShape(model.input_shape[0]),
        tensor_array_name='outer_input_ta_0').unstack(inputs)

    # TensorArray is used to write outputs at every timestep, but does not
    # support RaggedTensor; thus we must make TensorArray such that column length
    # is that of the longest outer step, # and pad model_b's outputs accordingly
    element_shape = tf.TensorShape((longest_step, *model.output_shape[0]))

    # overall shape: (outer_steps, longest_step, *model_b.output_shape)
    # for every input / at each step we write in dim0 (outer_steps)
    output_ta = tensor_array_ops.TensorArray(
        dtype=model.output[0].dtype,
        size=outer_steps,
        element_shape=element_shape,
        tensor_array_name='outer_output_ta_0')

    outer_t = tf.constant(0, dtype='int32')
    initial_states = (tf.zeros(model.input_shape[0], dtype='float32'),)

    return (outer_steps, steps_at_t, longest_step, outer_t, initial_states,
            output_ta, input_ta)


def model_rnn(model, inputs, states):
    def step_function(inputs, states):
        output, new_states = model([inputs, *states], training=True)
        return output, new_states

    initial_states = states
    input_ta, output_ta, time, time_steps_t = _process_args(model, inputs)

    def _step(time, output_ta_t, *states):
        current_input = input_ta.read(time)
        output, new_states = step_function(current_input, tuple(states))

        flat_state = nest.flatten(states)
        flat_new_state = nest.flatten(new_states)
        for state, new_state in zip(flat_state, flat_new_state):
            if isinstance(new_state, ops.Tensor):
                new_state.set_shape(state.shape)

        output_ta_t = output_ta_t.write(time, output)
        new_states = nest.pack_sequence_as(initial_states, flat_new_state)
        return (time + 1, output_ta_t) + tuple(new_states)

    final_outputs = tf.while_loop(
        body=_step,
        loop_vars=(time, output_ta) + tuple(initial_states),
        cond=lambda time, *_: tf.math.less(time, time_steps_t))

    new_states = final_outputs[2:]
    output_ta = final_outputs[1]
    outputs = output_ta.stack()
    return outputs, new_states


def _process_args(model, inputs):
    time_steps_t = tf.constant(inputs.shape[0], dtype='int32')

    # assume single-input network (excluding states)
    input_ta = tensor_array_ops.TensorArray(
        dtype=inputs.dtype,
        size=time_steps_t,
        tensor_array_name='input_ta_0').unstack(inputs)

    # assume single-output network (excluding states)
    output_ta = tensor_array_ops.TensorArray(
        dtype=model.output[0].dtype,
        size=time_steps_t,
        element_shape=tf.TensorShape(model.output_shape[0]),
        tensor_array_name='output_ta_0')

    time = tf.constant(0, dtype='int32', name='time')
    return input_ta, output_ta, time, time_steps_t

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM