師生系統：用 Top-k 假設列表訓練學生

Question

我想配置一個師生系統，其中教師 seq2seq model 生成一個前k個假設列表，用於訓練學生 seq2seq model。

我的計划是對教師的假設進行批處理，這意味着教師輸出一個批處理軸長度為k * B的張量，其中B是輸入批處理軸長度。 output 批張量現在包含輸入批張量中每個序列的k個假設，按輸入批中相關輸入序列的 position 排序。
這個張量被設置為學生的訓練目標。 但是，學生的批張量仍然具有B批軸長度，因此我利用tf.repeat將學生編碼器的 output 張量中的序列重復k次，然后將該張量輸入學生的解碼器。

出於調試目的，我做了簡化，以重復老師的單一最佳假設，現在，在我要實現前k列表選擇之前。

這是我的配置文件的摘要：

[...]

# Variables:

student_target = "teacher_hypotheses_stack"

[...]

# Custom repeat function:

def repeat(source, src_name="source", **kwargs):
    import tensorflow as tf

    input = source(0)
    input = tf.Print(input, [src_name, "in", input, tf.shape(input)])

    output = tf.repeat(input, repeats=3, axis=1)
    output = tf.Print(output, [src_name, "out", output, tf.shape(output)])

    return output

def repeat_t(source, **kwargs):
    return repeat(source, "teacher")


def repeat_s(source, **kwargs):
    return repeat(source, "student")


[...]

# Configuration of the teacher + repeating of its output

**teacher_network(), # The teacher_network is a encoder-decoder seq2seq model. The teacher performs search during training and is untrainable
"teacher_stack": {
    "class": "eval", "from": ["teacher_decision"], "eval": repeat_t,
    "trainable": False
    # "register_as_extern_data": "teacher_hypotheses_stack"
},
"teacher_stack_reinterpreter": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
    "class": "reinterpret_data",
    "set_axes": {"B": 1, "T": 0},
    "enforce_time_major": True,
    "from": ["teacher_stack"],
    "trainable": False,
    "register_as_extern_data": "teacher_hypotheses_stack"
}

[...]

# Repeating of the student's encoder ouput + configuration of its decoder

"student_encoder": {"class": "copy", "from": ["student_lstm6_fw", "student_lstm6_bw"]},  # dim: EncValueTotalDim
"student_encoder_repeater": {"class": "eval", "from": ["student_encoder"], "eval": repeat},
"student_encoder_stack": {  # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
    "class": "reinterpret_data",
    "set_axes": {"B": 1, "T": 0},
    "enforce_time_major": True,
    "from": ["student_encoder_repeater"]
},

"student_enc_ctx": {"class": "linear", "activation": None, "with_bias": True, "from": ["student_encoder_stack"], "n_out": EncKeyTotalDim},  # preprocessed_attended in Blocks
"student_inv_fertility": {"class": "linear", "activation": "sigmoid", "with_bias": False, "from": ["student_encoder_stack"], "n_out": AttNumHeads},
"student_enc_value": {"class": "split_dims", "axis": "F", "dims": (AttNumHeads, EncValuePerHeadDim), "from": ["student_encoder_stack"]},  # (B, enc-T, H, D'/H)

"model1_output": {"class": "rec", "from": [], 'cheating': config.bool("cheating", False), "unit": {
    'output': {'class': 'choice', 'target': student_target, 'beam_size': beam_size, 'cheating': config.bool("cheating", False), 'from': ["model1_output_prob"], "initial_output": 0},
    "end": {"class": "compare", "from": ["output"], "value": 0},
    'model1_target_embed': {'class': 'linear', 'activation': None, "with_bias": False, 'from': ['output'], "n_out": target_embed_size, "initial_output": 0},  # feedback_input
    "model1_weight_feedback": {"class": "linear", "activation": None, "with_bias": False, "from": ["prev:model1_accum_att_weights"], "n_out": EncKeyTotalDim, "dropout": 0.3},
    "model1_s_transformed": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_s"], "n_out": EncKeyTotalDim, "dropout": 0.3},
    "model1_energy_in": {"class": "combine", "kind": "add", "from": ["base:student_enc_ctx", "model1_weight_feedback", "model1_s_transformed"], "n_out": EncKeyTotalDim},
    "model1_energy_tanh": {"class": "activation", "activation": "tanh", "from": ["model1_energy_in"]},
    "model1_energy": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_energy_tanh"], "n_out": AttNumHeads},  # (B, enc-T, H)
    "model1_att_weights": {"class": "softmax_over_spatial", "from": ["model1_energy"]},  # (B, enc-T, H)
    "model1_accum_att_weights": {"class": "eval", "from": ["prev:model1_accum_att_weights", "model1_att_weights", "base:student_inv_fertility"],
                                 "eval": "source(0) + source(1) * source(2) * 0.5", "out_type": {"dim": AttNumHeads, "shape": (None, AttNumHeads)}},
    "model1_att0": {"class": "generic_attention", "weights": "model1_att_weights", "base": "base:student_enc_value"},  # (B, H, V)
    "model1_att": {"class": "merge_dims", "axes": "except_batch", "from": ["model1_att0"]},  # (B, H*V)
    "model1_s": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:model1_target_embed", "prev:model1_att"], "n_out": 1000, "dropout": 0.3},  # transform
    "model1_readout_in": {"class": "linear", "from": ["model1_s", "prev:model1_target_embed", "model1_att"], "activation": None, "n_out": 1000, "dropout": 0.3},  # merge + post_merge bias
    "model1_readout": {"class": "reduce_out", "mode": "max", "num_pieces": 2, "from": ["model1_readout_in"]},
    "model1_output_prob": {
        "class": "softmax", "from": ["model1_readout"], "dropout": 0.3,
        "target": student_target,
        "loss": "ce", "loss_opts": {"label_smoothing": 0.1}
    }
}, "target": student_target},

[...]

運行此配置將在控制台打印以下錯誤消息：

[...]

Create Adam optimizer.
Initialize optimizer (default) with slots ['m', 'v'].
These additional variable were created by the optimizer: [<tf.Variable 'optimize/beta1_power:0' shape=() dtype=float32_ref>, <tf.Variable 'optimize/beta2_power:0' shape=() dtype=float32_ref>].
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
TensorFlow exception: assertion failed: [x.shape[0] != y.shape[0]] [69 17] [23]
     [[node objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

[...]

Execute again to debug the op inputs...
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1_1:0' shape=(1,) dtype=int32> = shape (1,), dtype int32, min/max 23/23, ([23])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0_1:0' shape=() dtype=string> = bytes(b'x.shape[0] != y.shape[0]')
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_2:0' shape=(2,) dtype=int32> = shape (2,), dtype int32, min/max 17/69, ([69 17])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All_1:0' shape=() dtype=bool> = bool_(False)
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
Op inputs:
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All:0' shape=() dtype=bool>: bool_(False)
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0:0' shape=() dtype=string>: bytes(b'x.shape[0] != y.shape[0]')
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape:0' shape=(2,) dtype=int32>: shape (2,), dtype int32, min/max 17/69, ([69 17])
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1:0' shape=(1,) dtype=int32>: shape (1,), dtype int32, min/max 23/23, ([23])
Step meta information:
{'seq_idx': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
 'seq_tag': ['seq-0','seq-1','seq-2','seq-3','seq-4','seq-5','seq-6','seq-7','seq-8','seq-9','seq-10','seq-11','seq-12','seq-13','seq-14','seq-15','seq-16','seq-17','seq-18','seq-19','seq-20','seq-21','seq-22']}
Feed dict:
  <tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 80) dtype=float32>: shape (23, 42, 80), dtype float32, min/max -0.5/0.4, mean/stddev -0.050000004/0.28722814, Data(name='data', shape=(None, 80), batch_shape_meta=[B,T|'time:var:extern_data:data',F|80])
  <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 42/42, ([42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42])
  <tf.Tensor 'extern_data/placeholders/source_text/source_text:0' shape=(?, ?, 512) dtype=float32>: shape (23, 13, 512), dtype float32, min/max -0.5/0.4, mean/stddev -0.050011758/0.28722063, Data(name='source_text', shape=(None, 512), available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:source_text',F|512])
  <tf.Tensor 'extern_data/placeholders/source_text/source_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 13/13, ([13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13])
  <tf.Tensor 'extern_data/placeholders/target_text/target_text:0' shape=(?, ?) dtype=int32>: shape (23, 17), dtype int32, min/max 6656/6694, Data(name='target_text', shape=(None,), dtype='int32', sparse=True, dim=35209, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:target_text'])
  <tf.Tensor 'extern_data/placeholders/target_text/target_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 17/17, ([17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17])
  <tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>: bool(True)
EXCEPTION

[...]
File "home/philipp/Documents/bachelor-thesis/returnn/repository/TFUtil.py", line 4374, in sparse_labels_with_seq_lens
    x = check_dim_equal(x, 0, seq_lens, 0)
[...]

因此，網絡構建時沒有錯誤，但在第一個訓練步驟中，由於斷言錯誤而崩潰。 對我來說，它看起來像 RETURNN 或 TensorFlow 以某種方式驗證了批次長度與其原始值的對比。 但我不知道在哪里以及為什么，所以我不知道該怎么做。

我究竟做錯了什么？ 我的想法甚至可以通過 RETURNN 實現嗎？

編輯（2020 年 6 月 10 日）：澄清：我的最終目標是讓老師為每個輸入序列生成一個 top-k 假設列表，然后用於訓練學生。 因此，對於學生的每個輸入序列，有 k 個解/目標序列。 要訓練學生，它必須預測每個假設的概率，然后計算交叉熵損失以確定更新梯度。 但是如果每個輸入序列有 k 個目標序列，學生必須解碼編碼器狀態 k 次，每次都針對不同的目標序列。 這就是為什么我要重復編碼器狀態 k 次，以使學生解碼器的數據並行，然后使用 RETURNN 的默認交叉熵損失實現：

input-seq-1 --- teacher-hyp-1-1; 
input-seq-1 --- teacher-hyp-1-2; 
...; 
input-seq-1 --- teacher-hyp-1-k; 
input-seq-2 --- teacher-hyp-2-1; 
...

有沒有更合適的方法來實現我的目標？

編輯（2020 年 6 月 12 日 #1）：是的，我知道老師的DecisionLayer已經選擇了最佳假設，這樣，我只會重復最佳假設 k 次。 我這樣做是朝着我的最終目標邁出的中間一步。 后來想通過某種方式從老師的ChoiceLayer中獲取top-k 列表，但感覺這里是另外一個工地。
但是 Albert，你說 RETURNN 會以某種方式自動擴展批處理維度上的數據嗎？ 我怎么能想象得到呢？

編輯（2020 年 6 月 12 日 #2）：好的，現在我 select 來自教師選擇層（或 output 層）的 top-k（這次 k=4）假設列表：

"teacher_hypotheses": {
    "class": "copy", "from": ["extra.search:teacherMT_output"],
    "register_as_extern_data": "teacher_hypotheses_stack"
}

但是使用這個數據作為學生的訓練目標會導致錯誤：

TensorFlow exception: assertion failed: [shape[0]:] [92] [!=] [dim:] [23]
     [[node studentMT_output/rec/subnet_base/check_seq_len_batch_size/check_input_dim/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

也就是說，我假設，由於學生的目標數據（假設列表）的批處理軸長度 k=4 倍於學生的輸入數據/編碼器 state 數據之一的問題。 學生編碼器 state 數據這里不需要擴展/重復，以匹配目標數據嗎？

編輯（2020 年 6 月 12 日 #3） ：我認為最初的問題已解決。 這里繼續整個問題教師-學生系統：用每個輸入序列的 k 個目標序列訓練學生

Answer 1

它不僅驗證批次長度。 它將折疊批次和時間（它使用flatten_with_seq_len_mask ，請參見Loss.init的代碼和該函數），然后計算該展平張量的損失。 所以seq長度也需要匹配。 這可能是個問題，但我不確定。 由於您對 rec 層本身也有相同的目標，因此它在訓練中應該具有相同的 seq 長度。

您可以通過仔細檢查 debug_print_layer_output_template 的debug_print_layer_output_template來調試它，即檢查Data （batch-shape-meta）output，如果軸都如您預期的那樣正確。（ debug_print_layer_output_template可以並且應該始終啟用。它不會使其變慢。）您也可以臨時啟用debug_print_layer_output_shape ，這將真正打印所有張量的形狀。 這樣您就可以驗證它的外觀。

您對ReinterpretDataLayer的使用看起來非常錯誤。 您永遠不應該通過 integer 明確設置軸（如"set_axes": {"B": 1, "T": 0} ）。 你為什么要這樣做？ 這可能是它最終搞砸的原因。

您的repeat function 不是很通用。 您也在那里使用硬編碼軸整數。 你不應該那樣做。 相反，你會這樣寫：

input_data = source(0, as_data=True)
input = input_data.placeholder
...
output = tf.repeat(input, repeats=3, axis=input_data.batch_dim_axis)

我是否理解正確，這就是您想要做的？ 在批處理軸上重復？ 這種情況下，還需要適配該層的output的seq長度信息。 您不能簡單地在 EvalLayer 中按原樣使用該EvalLayer 。 您還需要將out_type定義為正確返回正確Data模板的 function。 比如像這樣：

def repeat_out(out):
   out = out.copy()
   out.size_placeholder[0] = tf.repeat(out.size_placeholder[0], axis=0, repeats=3)
   return out

...
"student_encoder_repeater": {
    "class": "eval", "from": ["student_encoder"], "eval": repeat,
    "out_type": lambda sources, **kwargs: repeat_out(sources[0].output)
}

現在你有一個額外的問題，每次你調用這個repeat_out ，你會得到另一個 seq 長度信息。 RETURNN 將無法判斷這些 seq 長度是相同還是不同（在編譯時）。 這會導致錯誤或奇怪的效果。 要解決這個問題，您應該重用相同的 seq 長度。 比如像這樣：

"teacher_stack_": {
    "class": "eval", "from": "teacher_decision", "eval": repeat
},
"teacher_stack": {
    "class": "reinterpret_data", "from": "teacher_stack_", "size_base": "student_encoder_repeater"
}

順便說一句，你為什么要重復這個？ 這背后的想法是什么？ 你把學生和老師都重復了3遍？ 那么僅僅將你的學習率提高 3 倍也會有同樣的效果嗎？

編輯：似乎這樣做是為了匹配 top-k 列表。 在這種情況下，這都是錯誤的，因為 RETURNN 應該已經自動執行這種重復。 您不應該手動執行此操作。

編輯：要了解重復（以及一般的光束搜索解析）是如何工作的，首先你應該查看日志debug_print_layer_output_template （你必須啟用 debug_print_layer_output_template，但無論如何你都應該一直這樣做）。 您將看到每一層的 output，尤其是其Data output object。 這對於檢查形狀是否都符合您的預期已經很有用（檢查日志中的batch_shape_meta ）。 然而，這只是編譯時的 static 形狀，所以 batch-dim 只是一個標記。 您還將看到搜索光束信息。 如果批次來自某個光束搜索（基本上是任何ChoiceLayer ），並且具有光束和光束大小，這將保持跟蹤。 現在，在代碼中，檢查SearchChoices.translate_to_common_search_beam及其用法。 當您按照代碼操作時，您將看到SelectSearchSourcesLayer ，實際上您的案例將以output.copy_extend_with_beam(search_choices.get_beam_info()) 。

編輯：重復，這是自動完成的。 您不需要手動調用copy_extend_with_beam 。

如果您希望從老師那里獲得 top-k 列表，那么您也可能做錯了，因為我看到您使用"teacher_decision"作為輸入。 我猜這是來自DecisionLayer ？ 在這種情況下，它已經只取了 top-k 光束中的第一佳。

編輯：現在我明白你忽略了這一點，而是只想取第一個最好的，然后也重復這個。 我建議不要這樣做，因為你讓它變得不必要的復雜，而且你有點與 RETURNN 作斗爭，它知道 batch-dim 應該是什么並且會感到困惑。 （你可以通過我上面寫的讓它工作，但實際上，這只是不必要的復雜。）

順便說一句，將EvalLayer設置為"trainable": False 。 那沒有效果。 eval 層無論如何都沒有參數。

師生系統：用 Top-k 假設列表訓練學生

問題描述

1 個解決方案

解決方案1
0 已采納 2020-06-10 10:53:50

師生系統：用 Top-k 假設列表訓練學生

問題描述

1 個解決方案

解決方案1 0 已采納 2020-06-10 10:53:50

解決方案1
0 已采納 2020-06-10 10:53:50