简体繁体 English

FCN 头是如何对 RetinaNet 的 FPN 特征进行卷积的？

[英]How are the FCN heads convolved over RetinaNet's FPN features?

原文 2020-05-11 18:41:14 3 1 machine-learning/ deep-learning/ retinanet

I've recently read the RetinaNet paper and I have yet to understood one minor detail:我最近阅读了 RetinaNet 论文，但我还没有理解一个小细节：
We have the multi-scale feature maps obtained from the FPN (P2,...P7).我们有从 FPN (P2,...P7) 获得的多尺度特征图。
Then the two FCN heads (the classifier head and regessor head) are convolving each one of the feature maps.然后两个 FCN 头（分类器头和回归器头）对每个特征图进行卷积。
However, each feature map has different spatial scale, so, how does the classifier head and regressor head maintain fixed output volumes, given all their convolution parameters are fix?但是，每个特征 map 都有不同的空间尺度，那么，在所有卷积参数都固定的情况下，分类器头和回归器头如何保持固定的 output 体积？ (ie 3x3 filter with stride 1, etc). （即步幅为 1 的 3x3 过滤器等）。

Looking at this line at PyTorch's implementation of RetinaNet, I see the heads just convolve each feature and then all features are stacked somehow (the only common dimension between them is the Channel dimension which is 256, but spatially they are double from each other).看看PyTorch 的 RetinaNet 实现中的这条线，我看到头部只是卷积每个特征，然后所有特征都以某种方式堆叠（它们之间唯一的共同维度是 Channel 维度，它是 256，但在空间上它们是彼此的两倍）。
Would love to hear how are they combined, I wasn't able to understand that point.很想听听它们是如何结合在一起的，我无法理解这一点。

1 个解决方案

After the convolution at each pyramid step, you reshape the outputs to be of shape (H*W, out_dim) (with out_dim being num_classes * num_anchors for the class head and 4 * num_anchors for the bbox regressor).在每个金字塔步骤的卷积之后，您将输出重塑为形状(H*W, out_dim) （ out_dim是num_classes * num_anchors用于 class 头部和4 * num_anchors用于 bbox 回归器）。 Finally, you can concatenate the resulting tensors along the H*W dimension, which is now possible because all the other dimensions match, and compute losses as you would on a network with a single feature layer.最后，您可以沿H*W维度连接生成的张量，这现在是可能的，因为所有其他维度都匹配，并像在具有单个特征层的网络上一样计算损失。