矢量化波束搜索解碼器在 GPU - Tensorflow 上的速度並不快 2

Question

我正在嘗試以矢量化方式在tf.keras.Model上運行 RNN 波束搜索，以使其在 GPU 上完全工作。 然而，盡管一切都像tf.function一樣，盡我所能矢量化，但無論有沒有 GPU，它的運行速度都完全相同。 附件是一個帶有假 model 的最小示例。 實際上，對於 n=32，k=32，steps=128，這是我想要使用的，這需要 20 秒（每個 n=32 個樣本）來解碼，無論是在 CPU 上還是在 GPU 上！

我肯定錯過了什么。 當我訓練 model 時，在 GPU 上，批量大小為 512 的訓練迭代（128 步）需要 100 毫秒，而在 CPU 上，批量大小為 32 的訓練迭代需要 1 秒。 GPU 在批量大小為 512 時沒有飽和。我知道我有單獨執行這些步驟和每個步驟執行阻塞操作的開銷，但在計算方面，與 Z2DDF84C35E630FDAF39 的 rest 相比，我的開銷可以忽略不計。

I also get that using a tf.keras.Model in this way is probably not ideal, but is there another way to wire output tensors via a function back to the input tensors, and particularly also rewire the states?

完整的工作示例： https://gist.github.com/meowcat/e3eaa4b8543a7c8444f4a74a9074b9ae

@tf.function
def decode_beam(states_init, scores_init, y_init, steps, k, n):    
    states = states_init
    scores = scores_init
    xstep = embed_y_to_x(y_init)

    # Keep the results in TensorArrays
    y_chain = tf.TensorArray(dtype="int32", size=steps)
    sequences_chain = tf.TensorArray(dtype="int32", size=steps)
    scores_chain = tf.TensorArray(dtype="float32", size=steps)


    for i in range(steps):
        # model_decode is the trained model with 3.5 million trainable params.
        # Run a single step of the RNN model.
        y, states = model_decode([xstep, states])
        # Add scores of step n to previous scores
        # (I left out the sequence end killer for this demo)
        scores_y = tf.expand_dims(tf.reshape(scores, y.shape[:-1]), 2) + tm.log(y)
        # Reshape into (n,k,tokens) and find the best k sequences to continue for each of n candidates
        scores_y = tf.reshape(scores_y, [n, -1])
        top_k = tm.top_k(scores_y, k, sorted=False)
        # Transform the indices. I was using tf.unravel_index but
        # `tf.debugging.set_log_device_placement(True)` indicated that this would be placed on the CPU
        # thus I rewrote it
        top_k_index = tf.reshape(
                top_k[1] + tf.reshape(tf.range(n), (-1, 1)) * scores_y.shape[1], [-1])
        ysequence = top_k_index // y.shape[2]
        ymax = top_k_index % y.shape[2]
        # this gives us two (n*k,) tensors with parent sequence (ysequence) 
        # and chosen character (ymax) per sequence.
        # For continuation, pick the states, and "return" the scores
        states = tf.gather(states, ysequence)
        scores = tf.reshape(top_k[0], [-1])
        # Write the results into the TensorArrays,
        # and embed for the next step
        xstep = embed_y_to_x(ymax)
        y_chain = y_chain.write(i, ymax)
        sequences_chain = sequences_chain.write(i, ysequence)
        scores_chain = scores_chain.write(i, scores)
    # Done: Stack up the results and return them
    sequences_final = sequences_chain.stack()
    y_final = y_chain.stack()
    scores_final = scores_chain.stack()

    return sequences_final, y_final, scores_final

Answer 1

這里發生了很多事情。 我會對此發表評論，因為它可能會幫助其他人解決 TensorFlow 性能問題。

剖析

GPU 分析器庫 (cupti) 未在集群上正確加載，阻止我對 GPU 進行任何有用的分析。 這已修復，所以我現在得到了 GPU 的有用配置文件。

請注意這個非常有用的答案（網絡上唯一的一個），它顯示了如何分析任意 TensorFlow 2 代碼，而不是 Keras 培訓：

https://stackoverflow.com/a/56698035/1259675

logdir = "log"
writer = tf.summary.create_file_writer(logdir)
tf.summary.trace_on(graph=True, profiler=True)

# run any @tf.function decorated functions here

sequences, y, scores = decode_beam_steps(
    y_init, states_init, scores_init, 
    steps = steps, k = k, n = n, pad_mask = pad_mask)  

with writer.as_default():
    tf.summary.trace_export(name="model_trace", step=0, profiler_outdir=logdir)
tf.summary.trace_off()

請注意，需要舊的 Chromium 版本來查看分析結果，因為當時 (4-17-20) 這在當前的 Chrome/Chromium 中失敗。

小優化

通過在 model（此處未顯示）使用的 LSTM 單元中使用unroll=True ，該圖變得更輕但速度沒有明顯加快，因為只需要一個步驟，因此符號循環只會增加混亂。 當 AutoGraph 構建圖形時，這顯着縮短了上述 function 的第一次迭代時間。 請注意，這個時間是巨大的（見下文）。
unroll=False （默認）在 300 秒內構建， unroll=True在 100 秒內構建。 請注意，性能本身保持不變（15-20 秒/迭代，n=32，k=32）。

implementation=1稍微慢了一點，所以我保留了implementation=2的默認值。

使用`tf.while_loop`而不是依賴 AutoGraph

for i in range(steps)循環。 我在（上圖所示）內聯版本和模塊化版本中都有這個：

    for i in range(steps):
        ystep, states = model_decode([xstep, states])
        ymax, ysequence, states, scores = model_beam_step(
            ystep, states, scores, k, n, pad_mask)
        xstep = model_rtox(ymax)
        y_chain = y_chain.write(i, ymax)
        sequences_chain = sequences_chain.write(i, ysequence)
        scores_chain = scores_chain.write(i, scores)

其中model_beam_step執行所有的波束搜索數學。 不出所料，兩者的表現完全相同，特別是，當 AutoGraph 跟蹤圖表時，它們在第一次運行時都花費了大約 100/300 秒。 此外，使用分析器跟蹤圖形會產生一個瘋狂的 30-50mb 文件，該文件不會輕易加載到 Tensorboard 上，並且或多或少會導致它崩潰。 該配置文件有數十個並行的 GPU 流，每個流都有一個操作。

將其替換為tf.while_loop將設置時間縮短為零（ back_prop=False差異很小），並生成一個漂亮的 500kb 圖形，可以在 TensorBoard 中輕松查看，並以有用的方式使用 4 個 GPU 流進行分析。


    beam_steps_cond = lambda i, y_, seq_, sc_, xstep, states, scores: i < steps
    def decode_beam_steps_body(i, y_, seq_, sc_, xstep, states, scores):
        y, states = model_decode([xstep, states])
        ymax, ysequence, states, scores = model_beam_step(
                y, states, scores, k, n, pad_mask)
        xstep = model_rtox(ymax)
        y_ = y_.write(i, ymax)
        seq_ = seq_.write(i, ysequence)
        sc_= sc_.write(i, scores)
        i = i + 1
        return i, y_, seq_, sc_, xstep, states, scores
    
    _, y_chain, sequences_chain, scores_chain, _, _, _ = \
        tf.while_loop(
            cond = beam_steps_cond,
            body = decode_beam_steps_body,
            loop_vars = [i, y_chain, sequences_chain, scores_chain,
                         xstep, states, scores],
            back_prop = False
            )

最后，真正的問題

我實際上能夠以有意義的方式查看配置文件，這表明真正的問題是在 CPU 上運行的 output 后處理 function。 我沒有懷疑它，因為它之前運行得很快，但我忽略了我所做的波束搜索修改導致每個候選者 >>>k 個序列，這大大減慢了處理速度。 因此，它削減了我可以通過解碼步驟在 GPU 上高效獲得的所有好處。 如果沒有此后處理，GPU 運行 >2 次迭代/秒。 將后處理（如果做得好的話會非常快）重構為 TensorFlow 解決了這個問題。

矢量化波束搜索解碼器在 GPU - Tensorflow 上的速度並不快 2

問題描述

1 個解決方案

解決方案1
0 2020-04-16 23:55:29

剖析

小優化

使用`tf.while_loop`而不是依賴 AutoGraph

最后，真正的問題

矢量化波束搜索解碼器在 GPU - Tensorflow 上的速度並不快 2

問題描述

1 個解決方案

解決方案1 0 2020-04-16 23:55:29

剖析

小優化

使用tf.while_loop而不是依賴 AutoGraph

最后，真正的問題

解決方案1
0 2020-04-16 23:55:29

使用`tf.while_loop`而不是依賴 AutoGraph