What are the tricks of deep learning brush SOTA?

OpenMMLab
3 min readSep 9, 2022

--

For image classification tasks, taking the trick used in Swin-Transformer as an example, let’s briefly sort out some of the common tricks currently used in deep learning.

1. Stochastic Depth

This approach was first proposed in the article Deep Networks with Stochastic Depth, referred to as stochastic depth in the original article. It was called drop connect by Google in the EfficientNet implementation.Because of the name clash with DropConnect, the name was changed to drop path in timm's implementation (but this name also clashes with DropPath, embarrassing). So when you hear these terms, you should pay attention to distinguish which one is which.

Stochastic depth is similar to dropout but different. Simply put, dropout randomly discards some of the activation values during training, while “stochastic depth” directly discards some of the samples, i.e., sets the value of these samples to zero.

Therefore, this method can only be used in the residual structure, where some of the samples in the network output are directly discarded and then added to the shortcut, thus achieving the effect of “skipping” some of the samples in the residual structure.

By skipping part of the residual structure, it actually combines multiple deep networks, similar to integrated learning, thus improving the network’s performance.

2. Mixup & CutMix

Both are image mixing enhancement tools, i.e., during training, we mix two samples in a certain way and mix their labels accordingly. The difference between Mixup and CutMix lies in how the images are mixed.

This image mixing enhancement aims to smooth the low-dimensional stream shape embedded in the image after mapping by the neural network, thus improving the network’s generalization ability. For a detailed description of what image mixing enhancement means, see https://mmclassification.readthedocs.io/en/master/api/models.utils.augment.html

3. RandAugment

This combined data enhancement means, compared with the traditional data enhancement of random cropping and flipping, this method sets up a collection containing various data enhancement transformations. It applies several of them randomly to each sample, greatly expanding the enhanced image space.

For more information about RandAugmentation, see https://mmclassification.readthedocs.io/en/master/api/transforms.html#composed-augmentation

4. RandomErasing

This method comes from Random Erasing Data Augmentation, which has a simple core idea of randomly selecting and filling a region in an image. It simulates a real task where the target to be recognized may be obscured by an object, thus improving the model’s generalization ability.

5. CosineAnnealingLR

Cosine learning rate decay, in recent image classification tasks, is the most dominant learning rate decay method.We all know that decaying the learning rate may allow the network to find the optimal solution with a higher learning rate in the early stage and eventually converge to the optimal solution with a lower learning rate later.Although today’s optimizers, such as Adam, can adapt their parameters to the learning rate, it is often still necessary to limit the optimization step of the optimizer by decaying the learning rate.And cosine learning rate decay provides a smooth learning rate decay curve with the following equation:

6. Weight decay

Weight decay is a regularization method that limits the range of some parameters in the network by adding the L2 parameters of the network parameters as part of the loss.

Too large individual parameters may cause the network to rely only on these parameters, thus “narrowing” the network and affecting its generalization ability.

The above tricks are all used in the swin-transformer configuration file for MMClassification.

Feel free to refer to https://github.com/open-mmlab/mmclassification/tree/master/configs/swin_transformer to use MMClassification and these tricks to improve the network performance.

--

--

OpenMMLab
OpenMMLab

No responses yet