如何在 Tensorflow 中有效地使用 tf.bucket_by_sequence_length？

Question

So I'm trying to use tf.bucket_by_sequence_length() from Tensorflow, but can not quite figure out how to make it work.所以我试图使用来自 Tensorflow 的 tf.bucket_by_sequence_length()，但无法弄清楚如何使它工作。

Basically, it should take sequences (of different lengths) as input and have buckets of sequences as output, but it does not seem to work this way.基本上，它应该将序列（不同长度）作为输入，并将序列桶作为输出，但它似乎不是这样工作的。

From this discussion: https://github.com/tensorflow/tensorflow/issues/5609 I have the impression that it needs a queue in order to feed this function, sequence by sequence.从这个讨论： https : //github.com/tensorflow/tensorflow/issues/5609我的印象是它需要一个队列来按顺序提供这个功能。 It's not clear though.不过还不清楚。

Function's documentation can be found here: https://www.tensorflow.org/versions/r0.12/api_docs/python/contrib.training/bucketing#bucket_by_sequence_length函数的文档可以在这里找到： https : //www.tensorflow.org/versions/r0.12/api_docs/python/contrib.training/bucketing#bucket_by_sequence_length

Answer 1

Indeed you need input tensor to be a queue, which can be eg a tf.FIFOQueue().deque() , or a tf.TensorArray().read(tf.train.range_input_producer()) .实际上，您需要输入张量作为队列，例如可以是tf.FIFOQueue().deque()或tf.TensorArray().read(tf.train.range_input_producer()) 。

This notebook that explains it quite well:这个笔记本很好地解释了它：

https://github.com/wcarvalho/jupyter_notebooks/blob/ebe762436e2eea1dff34bbd034898b64e4465fe4/tf.bucket_by_sequence_length/bucketing%20practice.ipynb https://github.com/wcarvalho/jupyter_notebooks/blob/ebe762436e2eea1dff34bbd034898b64e4465fe4/tf.bucket_by_sequence_length/bucketing%20practice.ipynb

Answer 2

My following answer is based on Tensorflow2.0.我的以下答案基于 Tensorflow2.0。 I can see that you might be using an older version of Tensorflow.我可以看到您可能正在使用旧版本的 Tensorflow。 But if you happen to use the new version, you can effectively use the bucket_by_sequence_length API in the following manner.但是如果你碰巧使用新版本，你可以通过以下方式有效地使用bucket_by_sequence_length API。

# This will be used by bucket_by_sequence_length to batch them according to their length.
def _element_length_fn(x, y=None):
    return array_ops.shape(x)[0]


# These are the upper length boundaries for the buckets.
# Based on these boundaries, the sentences will be shifted to different buckets.
boundaries = [upper_boundary_for_batch] # Here you will have to define the upper boundaries for different buckets. You can have as many boundaries as you want. But make sure that the upper boundary contains the maximum length of the sentence that is in your dataset.

# These defines the batch sizes for different buckets.
# I am keeping the batch_size for each bucket same, but this can be changed based on more analysis.
# As per the documentation - batch size per bucket. Length should be len(bucket_boundaries) + 1.
# https://www.tensorflow.org/api_docs/python/tf/data/experimental/bucket_by_sequence_length
batch_sizes = [batch_size] * (len(boundaries) + 1)

# Bucket_by_sequence_length returns a dataset transformation function that has to be applied using dataset.apply.
# Here the important parameter is pad_to_bucket_boundary. If this is set to true then, the sentences will be padded to
# the bucket boundaries provided. If set to False, it will pad the sentences to the maximum length found in the batch.
# Default value for padding is 0, so we do not need to supply anything extra here.
dataset = dataset.apply(tf.data.experimental.bucket_by_sequence_length(_element_length_fn, boundaries,
                                                                       batch_sizes,
                                                                       drop_remainder=True,
                                                                       pad_to_bucket_boundary=True))

如何在 Tensorflow 中有效地使用 tf.bucket_by_sequence_length？

问题描述

2 个解决方案

解决方案1
2 2017-10-31 09:58:01

解决方案2
0 2020-07-04 12:01:24

如何在 Tensorflow 中有效地使用 tf.bucket_by_sequence_length？

问题描述

2 个解决方案

解决方案1 2 2017-10-31 09:58:01

解决方案2 0 2020-07-04 12:01:24

解决方案1
2 2017-10-31 09:58:01

解决方案2
0 2020-07-04 12:01:24