Paper Review: DAMO-YOLO: A new object detection framework that balances speed and accuracy

11 min readJan 17, 2023

The team recently open-sourced DAMO-YOLO! Its effect has reached the SOTA of the YOLO series, and I will share the specific content with you here.

1.Introduction

DAMO-YOLO is a target detection framework that takes into account both speed and accuracy, and its effect surpasses the current YOLO series methods, and it achieves SOTA while maintaining a high inference speed. DAMO-YOLO introduces a series of new technologies based on the YOLO framework, and significantly modifies the entire detection framework. Specifically, it includes: a new detection backbone structure based on NAS search, a deeper neck structure, a streamlined head structure, and the introduction of distillation technology to achieve further improvement of the effect. In addition to the model, DAMO-YOLO also provides efficient training strategies and convenient and easy-to-use deployment tools to help you quickly solve practical problems in industrial landing!

Ref: DAMOYOLO-高性能通用检测模型-S · 模型库 (modelscope.cn)
Ref: Code address: GitHub — tinyvision/damo-yolo

fig1: looks the avg performence is better than YOLOv7

2.Key technologies

2.1. NAS backbone: MAE-NAS

Backbone’s network structure plays an important role in object detection. DarkNet has been dominant in the early YOLO series. Recently, some work has also begun to explore other network structures that are effective for detection, such as YOLOv6 and YOLOv7. However, these networks are still artificially designed. With the development of neural network structure search technology (NAS), there are many NAS network structures that can be used for detection tasks, and compared with traditional manually designed networks, NAS network structures can achieve good detection results. Therefore, they use NAS technology to search for a suitable network structure as the backbone of theirs DAMO-YOLO. Here they use Ali’s self-developed [MAE-NAS] (https:// github.com/alibaba/lightweight-neural-architecture-search). MAE-NAS is a heuristic and training-free NAS search method that can be used to quickly search a wide range of backbone network structures of different sizes.

MAE-NAS uses information theory theory to evaluate the initialization network from the perspective of entropy, and the evaluation process does not require any training process, thus solving the disadvantage that the previous NAS search method needs to be trained and re-evaluated. Enabling a wide range of web searches in a short period of time reduces search costs and increases the likelihood that potentially better network structures can be found. It is particularly worth noting that in MAE-NAS search, they use K1K3 as the basic search module. At the same time, they directly use GPU inference latency Latency, instead of Flops, as the target budget. After searching, they apply spatial pyramid pooling and focus modules to the final backbone. Table 1 below shows how the performance of the different backbones compares. It can be seen that the effect of MAE-NAS backbone network is significantly better than that of DarkNet network structure.

2.2. Large Neck: RepGFPN

In FPN (Feature Pyramid Network), multi-scale feature fusion aims to aggregate features output from different stages of backbone, thereby enhancing the expression ability of output features and improving model performance. Traditional FPNs introduce top-to-down paths to incorporate multi-scale features. Given the limitations of one-way traffic, PAFPN adds an additional bottom-up path aggregation network, but increases the computational cost. To reduce the computation intensity, the YOLO series detection network chose PAFPN with CSPNet to fuse multiscale features from the backbone output.

Theirs work in ICLR2022 GiraffeDet proposes a novel Light-Backbone Heavy-Neck structure and achieves SOTA performance because the given neck structure GFPN (Generalized FPN) can fully exchange high-level semantic information and low-level spatial information. In GFPN, multiscale feature fusion occurs in different scale features of the previous layer and the current layer, in addition, cross-layer connections log_2(n) provide more efficient information transmission that can be extended to deeper networks.

Therefore, they tried to introduce GFPN into DAMO-YOLO, and they achieved higher accuracy compared to PANet, which is expected. However, at the same time, GFPN brings an increase in the model inference delay, so that the accuracy/delay trade-off does not achieve a great advantage. Through the analysis of the original GFPN structure, they attribute the reasons to the following aspects: (1) features of different scales share the same number of channels, which makes it difficult to give an optimal number of channels to ensure that high-level low-resolution features and low-level high-resolution features have the same rich expression ability; (2) GFPN uses Queen-Fusion to enhance the fusion between features, and Queen-Fusion contains a large number of upsampling and downsampling operations to achieve the fusion of features at different scales, which greatly affects the inference speed; (3) The efficiency of cross-scale feature fusion of 3x3 convolution used in GFPN is not high, which cannot meet the needs of lightweight computation and needs to be further optimized.

After the above analysis, they propose a new Efficient-RepGFPN on the basis of GFPN to meet the design of neck in real-time object detection, mainly including the following improvements: (1) different channel numbers are used for features at different scales, so as to flexibly control the expression ability of high-level features and low-level features under the constraints of lightweight computation; (2) The additional upsampling operation in Queen-Fusion is deleted, which greatly reduces the model inference delay when the accuracy drop is small; (3) The original convolution-based feature fusion is improved into CSPNet connection, and the idea of heavy parameterization and ELAN connection is introduced to improve the accuracy of the model without adding more computation. The final Efficient-RepGFPN network structure is shown in Figure 2 above. The ablation assay for Efficient-RepNGFPN is shown in Table 2 below.

As can be seen from Table 2, flexible control of the number of channels of different scale feature maps can achieve higher accuracy than all scale feature maps sharing the same number of channels, indicating that flexible control of the expression ability of high-level features and low-level features can bring more benefits. At the same time, by controlling the model at the same computational level, they also made the depth/width trade-off comparison in Efficient-RepGFPN, and when depth=3, width=(96,192,384), the model achieved the highest accuracy.

Table 3 compares the Ablation experiment comparison of Queen-Fusion connections, and the neck structure is PANet connection without adding additional upsampling and downsampling operators. They tried to add only the upsampling operator and the downsampling operator and the complete Queen-Fusion structure, and the model accuracy was improved. However, only adding the upsampling operator brings 0.6ms of inference time increase, and the accuracy is only 0.3 better, which is far lower than the accuracy/delay gain of only adding additional downsampling operators, so they abandon the additional upsampling operators in the final design.

In Table 4, they have experimentally compared the multi-scale feature fusion method, and it can be seen from the table that the feature fusion method of CSPNet is much better than the convolution-based fusion method under the low computational constraint, and at the same time, the introduction of heavy parameterization thinking and ELAN connection can bring a large accuracy improvement with a small increase in latency.

2.3. Small Head: ZeroHead

In this section, they focus on the detection head (ZeroHead) in DAMO-YOLO. At present, among the object detection methods, it is more common to use Decouple Head as the detection head. Decouple Head can achieve higher APs, but it will increase the calculation time of the model to some extent. To balance model speed and performance, they performed experiments in Table 5 below to select the appropriate neck and head specific gravities.

Table 5 The influence of different proportions of Neck and Head on training results

From Table 2,3,4, they can see that the “big neck, small head” structure will get better performance. Therefore, they discarded the “Decouple Head” that was commonly used in the previous method, and only retained a linear projection layer for classification and regression tasks, which they call ZeroHead. ZeroHead maximizes the amount of computation in the inspection head, freeing up more space for more complex necks, such as the RepGFPN neck. It is worth noting that ZeroHead can essentially be thought of as a Couple Head, which is also a significant difference from the Decouple Head sampled by the previous method.

2.4. Label Assignment: AlignOTA

Label assignment (label assignment) is a key component in object detection, the previous static allocation method often only considers anchor and ground truth IoU, this kind of allocation method is easy to lead to the classification task out of focus, as shown in the left figure of Figure 3, the hand detection frame to use the dot on the doll bear to make predictions, which is unreasonable for the model, the ideal label is shown in the right figure of Figure 3. In addition, such methods rely on anchor priors, and in industrial applications, the scale of the object to be inspected is varied, and it is very cumbersome to find a most suitable anchor prior.

Figure 3 Difference between static and dynamic allocation

In order to overcome the above problems, a number of label allocation methods using the classification and regression prediction values of the model have emerged in academia, which eliminate the dependence of label allocation on anchor, and consider the influence of classification and regression at the same time in allocation, eliminating the problem of out-of-focus to a certain extent. OTA is one of the classic works, which calculates the allocation loss according to the classification and regression prediction values of the model, and uses the Sinkhorn-Knopp algorithm to solve the global optimal allocation, which has excellent performance in complex allocation scenarios, so here they use OTA to accelerated simOTA as their allocation strategy. However, simOTA itself has certain problems, and when calculating distribution, it is not guaranteed to take into account the impact of balanced classification and regression on distribution. In other words, there is the problem of categorical regression loss misalignment. In response to this problem, they have modified the way the allocation score is calculated, as shown in the following formula.

In order to balance the classification and regression loss in label allocation, they introduce Focal Loss into the classification loss of label allocation, and replace the classified One-hot label with IoU, which relaxes the constraints on classification. Table 4 shows a comparison of the improved AlignOTA and simOTA, and you can see that AlignOTA has a significant improvement in performance.

2.5. Distillation Enhancement

Model distillation is an effective means to improve the effectiveness of models. YOLOv6 attempts to improve the model by using self-distillation technology in its large model. But overall, in the current work of the YOLO series, the application of distillation is not very common, especially distillation on small models. They did a special study on DAMO-YOLO, and finally applied distillation technology to improve the effect on various scale models of DAMO-YOLO.

The training process of DAMO-YOLO is divided into two phases, the first stage is training based on strong mosaic enhancement, and the second stage is training with closed mosaic enhancement. They found that using distillation technology in the first stage, faster convergence can be achieved to achieve higher results; However, the continued use of distillation in the second stage does not further enhance the effect. They believe that the data distribution of the second stage has shown a relatively large deviation from the first stage, and the knowledge distillation of the second stage will destroy the knowledge distribution learned in the first stage to a certain extent. The training time of the second stage is too short, so that the model cannot fully transition from the first stage knowledge distribution to the second stage knowledge distribution. However, if you forcibly lengthen the training cycle or increase the learning rate, on the one hand, it will increase the training cost and time, and on the other hand, it will weaken the effect of the first stage of distillation. Therefore, here they close the distillation operation in the second stage and only carry out the first stage distillation.

Fig. 4 The relationship between classification loss and accuracy under different distillation weights

Secondly, they introduce two techniques in distillation, one is the alignment module, which is used to align the feature map sizes of the teacher and student. The other is the normalization operation, which is used to weaken the effects of numerical scale fluctuations between teacher and student, which can be seen as a dynamic temperature coefficient for KL loss.

In addition, they also found that the weight of loss and head size of distillation also have a great influence on the distillation effect. As shown in Figure 4 above, when the distillation loss weight becomes larger, the classification loss convergence slows down and fluctuates greatly. They know that classification loss has a very large impact on detection. Its late convergence will lead to insufficient model optimization, which will affect the final detection effect. Therefore, unlike the previous distillation experience, in DAMO-YOLO, they use a smaller distillation weight to control the distillation loss, weakening the conflict between distillation loss and categorical loss.

At the same time, they use ZeroHead in the structure of the detection head. ZeroHead contains only one linear layer for task projection. Therefore, it is equivalent to distillation loss and classified loss optimizing the same feature space at the same time, and the learned space can meet the optimization of both distillation and classification, and further improve the consistency of classification loss and distillation loss optimization.

different distillation methods performence

3. Performance comparison

The Damo-YOLO team verified the performance of DAMO-YOLO on the MSCOCO val set. It can be seen that combined with the above improvements, DAMO-YOLO has achieved a significant improvement in accuracy under strict restrictions on latency, creating a new SOTA.

4. Finally

DAMO-YOLO is a nascent and rapidly developing testing framework, if you have any questions or suggestions, you can comment below.

Thank you for your watching!

Paper Review: DAMO-YOLO: A new object detection framework that balances speed and accuracy

1.Introduction

2.Key technologies

3. Performance comparison

4. Finally

Written by KevinLuo