简体繁体 English

Faster-RCNN，我们为什么不只使用RPN进行检测？

[英]Faster-RCNN, why don't we just use only RPN for detection?

原文 2017-02-01 09:32:41 2 5 machine-learning/ computer-vision/ object-detection

As we know, faster-RCNN has two main parts: one is region proposal network(RPN), and another one is fast-RCNN.众所周知，faster-RCNN有两个主要部分：一个是区域提议网络（RPN），另一个是fast-RCNN。

My question is, now that region proposal network(RPN) can output class scores and bounding boxes and is trainable, why do we need Fast-RCNN?我的问题是，既然区域提议网络（RPN）可以输出类别分数和边界框并且是可训练的，为什么我们需要 Fast-RCNN？

Am I thinking it right that the RPN is enough for detection (red circle), and Fast-RCNN is now becoming redundant (blue circle)?我是否认为 RPN 足以进行检测（红色圆圈），而 Fast-RCNN 现在变得多余（蓝色圆圈）？

5 个解决方案

Short answer: no they are not redundant.简短回答：不，它们不是多余的。 The R-CNN article and its variants popularized the use of what we used to call a cascade. R-CNN 文章及其变体普及了我们过去所说的级联的使用。 Back then for detection it was fairly common to use different detectors often very similar in structures to do detection because of their complementary power.那时对于检测，由于互补能力，使用结构上通常非常相似的不同检测器来进行检测是相当普遍的。

If the detections are partly orthogonal it allows to remove false positive along the way.如果检测部分是正交的，则可以沿途消除误报。

Furthermore by definition both parts of R-CNN have different roles the first one is used to discriminate objects from background and the second one to discriminate fine grained categories of objects from themselves (and from the background also).此外，根据定义，R-CNN 的两个部分具有不同的作用，第一个部分用于从背景中区分对象，第二个部分用于将细粒度类别的对象与自身（以及背景中）进行区分。

But you are right if there is only 1 class vs the background one could use only the RPN part to to detection but even in that case it would probably better the result to chain two different classifiers (or not see eg this article )但是，如果只有 1 个类与背景相比，您是对的，您可以仅使用 RPN 部分进行检测，但即使在这种情况下，链接两个不同的分类器的结果可能会更好（或者看不到例如这篇文章）

PS: I answered because I wanted to but this question is definitely unsuited for stackoverflow PS：我回答是因为我想但是这个问题绝对不适合stackoverflow

faster-rcnn is a two-stage method comparing to one stage method like yolo, ssd, the reason faster-rcnn is accurate is because of its two stage architecture where the RPN is the first stage for proposal generation and the second classification and localisation stage learn more precise results based on the coarse grained result from RPN.与 yolo、ssd 等单阶段方法相比，faster-rcnn 是一种两阶段方法，faster-rcnn 之所以准确是因为它的两阶段架构，其中 RPN 是生成提议的第一阶段和第二个分类和定位阶段根据来自 RPN 的粗粒度结果学习更精确的结果。

So yes, you can, but your performance is not good enough所以是的，你可以，但你的表现不够好

If you just add a class head to the RPN Network, you would indeed get detections, with scores and class estimates.如果你只是在 RPN 网络中添加一个班级负责人，你确实会得到检测，包括分数和班级估计。

However, the second stage is used mainly to obtain more accurate detection boxes.但是，第二阶段主要用于获得更准确的检测框。

Faster-RCNN is a two-stage detector, like Fast R-CNN. Faster-RCNN 是一个两阶段检测器，就像 Fast R-CNN 一样。 There, Selective Search was used to generate rough estimates of the location of objects and the second stage then refines them, or rejects them.在那里，选择性搜索用于生成对象位置的粗略估计，然后第二阶段对其进行细化或拒绝。

Now why is this necessary for the RPN?现在为什么这对 RPN 是必要的？ So why are they only rough estimates?那么为什么他们只是粗略的估计呢？

One reason is the limited receptive field: The input image is transformed via a CNN into a feature map with limited spatial resolution.一个原因是有限的感受野：输入图像通过 CNN 转换为空间分辨率有限的特征图。 For each position on the feature map, the RPN heads estimate if the features at that position correspond to an object and the heads regress the detection box.对于特征图上的每个位置，RPN 头部估计该位置的特征是否对应于一个对象，并且头部回归检测框。 The box regression is done based on the final feature map of the CNN.框回归是基于 CNN 的最终特征图完成的。 In particular, it may happen that the correct bounding box on the image is larger than the corresponding receptive field due to the CNN.特别是，由于 CNN，可能会发生图像上正确的边界框大于相应的感受野。

Example: Lets say we have an image depicting a person and the features at one position of the feature map indicate a high possibiliy for the person.示例：假设我们有一张描绘一个人的图像，并且特征图一个位置的特征表明该人的可能性很高。 Now, if the corresponding receptive field contains only the body parts, the regressor has to estimate a box enclosing the entire person, although it "sees" only the body part.现在，如果相应的感受野只包含身体部位，回归器必须估计一个包围整个人的框，尽管它只“看到”身体部位。

Therefore, RPN creates a rough estimate of the bounding box.因此，RPN 创建了边界框的粗略估计。 The second stage of Faster RCNN uses all features contained in the predicted bounding box and can correct the estimate. Faster RCNN 的第二阶段使用包含在预测边界框中的所有特征，并且可以校正估计。

In the example, RPN creates a too large bounding box, which is enclosing the person (since it cannot the see the pose of the person), and the second stage uses all information of this box to reshape it such that it is tight.在这个例子中，RPN 创建了一个太大的边界框，它包围了人（因为它看不到人的姿势），第二阶段使用这个框的所有信息来重塑它，使其变得紧凑。 This however can be done much more accurate, since more content of the object is accessable for the network.然而，这可以更准确地完成，因为网络可访问对象的更多内容。

我认为蓝色圆圈是完全多余的，只需添加一个类分类层（为每个包含对象的边界框提供类）应该可以正常工作，这就是单次检测器在精度受损的情况下所做的。

根据我的理解，RPN 仅用于二进制检查 bbox 中是否有对象，最后的 Detector 部分用于对类进行分类，例如汽车、人、电话等