简体   繁体   English

如何为Seq2Seq模型提供多个目标?

[英]How to provide multiple targets to a Seq2Seq model?

I'm doing video captioning on MSR-VTT dataset. 我在MSR-VTT数据集上做视频字幕。

In this dataset, I've got 10,000 videos and, for each videos, I've got 20 different captions . 在这个数据集中,我有10,000个视频, 对于每个视频,我有20个不同的标题

My model consists of a seq2seq RNN. 我的模型由seq2seq RNN组成。 Encoder's inputs are the videos features, decoder's inputs are embedded target captions and decoder's output are predicted captions. 编码器的输入是视频功能,解码器的输入是嵌入的目标字幕,解码器的输出是预测的字幕。

I'm wondering if using several time the same videos with different captions is useful, or not. 我想知道是否多次使用不同字幕的相同视频是否有用。


Since I couldn't find explicit info, I tried to benchmark it 由于我找不到明确的信息,我试图对它进行基准测试

Benchmark: 基准测试:

Model 1: One caption for each video 型号1:每个视频的一个标题

I trained it on 1108 sport videos, with a batch size of 5, over 60 epochs. 我在1108个体育视频上训练了它,批量大小为5,超过60个时期。 This configuration takes about 211 seconds per epochs. 此配置每个时期大约需要211秒。

Epoch 1/60 ; Batch loss: 5.185806 ; Batch accuracy: 14.67% ; Test accuracy: 17.64%
Epoch 2/60 ; Batch loss: 4.453338 ; Batch accuracy: 18.51% ; Test accuracy: 20.15%
Epoch 3/60 ; Batch loss: 3.992785 ; Batch accuracy: 21.82% ; Test accuracy: 54.74%
...
Epoch 10/60 ; Batch loss: 2.388662 ; Batch accuracy: 59.83% ; Test accuracy: 58.30%
...
Epoch 20/60 ; Batch loss: 1.228056 ; Batch accuracy: 69.62% ; Test accuracy: 52.13%
...
Epoch 30/60 ; Batch loss: 0.739343; Batch accuracy: 84.27% ; Test accuracy: 51.37%
...
Epoch 40/60 ; Batch loss: 0.563297 ; Batch accuracy: 85.16% ; Test accuracy: 48.61%
...
Epoch 50/60 ; Batch loss: 0.452868 ; Batch accuracy: 87.68% ; Test accuracy: 56.11%
...
Epoch 60/60 ; Batch loss: 0.372100 ; Batch accuracy: 91.29% ; Test accuracy: 57.51%

Model 2: 12 captions for each video 型号2:每个视频的12个字幕

Then I trained the same 1108 sport videos, with a batch size of 64. 然后我训练了相同的 1108运动视频,批量大小为64。
This configuration takes about 470 seconds per epochs. 此配置每个时期大约需要470秒。

Since I've 12 captions for each videos, the total number of samples in my dataset is 1108*12. 由于每个视频有12个字幕,因此我的数据集中的样本总数为1108 * 12。
That's why I took this batch size (64 ~= 12*old_batch_size). 这就是我采用这个批量大小(64~ = 12 * old_batch_size)的原因。 So the two models launch the optimizer the same number of times. 所以这两个模型启动优化器的次数相同。

Epoch 1/60 ; Batch loss: 5.356736 ; Batch accuracy: 09.00% ; Test accuracy: 20.15%
Epoch 2/60 ; Batch loss: 4.435441 ; Batch accuracy: 14.14% ; Test accuracy: 57.79%
Epoch 3/60 ; Batch loss: 4.070400 ; Batch accuracy: 70.55% ; Test accuracy: 62.52%
...
Epoch 10/60 ; Batch loss: 2.998837 ; Batch accuracy: 74.25% ; Test accuracy: 68.07%
...
Epoch 20/60 ; Batch loss: 2.253024 ; Batch accuracy: 78.94% ; Test accuracy: 65.48%
...
Epoch 30/60 ; Batch loss: 1.805156 ; Batch accuracy: 79.78% ; Test accuracy: 62.09%
...
Epoch 40/60 ; Batch loss: 1.449406 ; Batch accuracy: 82.08% ; Test accuracy: 61.10%
...
Epoch 50/60 ; Batch loss: 1.180308 ; Batch accuracy: 86.08% ; Test accuracy: 65.35%
...
Epoch 60/60 ; Batch loss: 0.989979 ; Batch accuracy: 88.45% ; Test accuracy: 63.45%

Here is the intuitive representation of my datasets: 这是我的数据集的直观表示:

Model 1 and Model 2


How can I interprete this results ? 我该如何解释这个结果?

When I manually looked at the test predictions, Model 2 predictions looked more accurate than Model 1 ones. 当我手动查看测试预测时,模型2预测看起来比模型1预测更准确。

In addition, I used a batch size of 64 for Model 2. That means that I could obtain even more good results by choosing a smaller batch size. 另外,我在模型2中使用了64的批量大小。这意味着通过选择较小的批量大小我可以获得更好的结果。 It seems I can't have better training method for Mode 1 since batch size is already very low 由于批量大小已经非常低,我似乎无法为模式1提供更好的训练方法

On the other hand, Model 1 have better loss and training accuracy results... 另一方面,模型1具有更好的损失和训练准确性结果......

What should I conclude ? 我应该得出什么结论?
Does the Model 2 constantly overwrites the previously trained captions with the new ones instead of adding new possible captions ? Model 2是否经常用新的字幕覆盖以前训练过的字幕而不是添加新的可能字幕?

Not sure if i understand this correctly since i only worked with neural networks like yolo but here is what i understand: You are training a network to caption videos and now you want train several captions per video right? 不知道我是否理解这一点,因为我只使用像yolo这样的神经网络,但这里是我理解的:你正在训练网络上的字幕视频,现在你想要为每个视频训练几个字幕吗? I guess the problem is that you are overwriting your previously trained captions with the new ones instead of adding new possible captions. 我想问题是你用新的字幕覆盖以前训练过的字幕,而不是添加新的可能字幕。

You need to train all possible captions from the start, not sure if this is supported with your network architecture though. 您需要从一开始就训练所有可能的标题,但不确定您的网络架构是否支持这些标题。 Getting this to work properly is a bit complex because you would need to compare your output to all possible captions. 让它正常工作有点复杂,因为您需要将输出与所有可能的标题进行比较。 Also you probably need to use the 20 most likely captions as output instead of just one to get the best possible result. 此外,您可能需要使用20个最可能的标题作为输出而不是仅使用一个以获得最佳结果。 I´m afraid i can´t do more than offering this thought because i wasn´t able to find a good source. 我害怕我做的不仅仅是提供这个想法,因为我无法找到一个好的来源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM