Channel Attention and Squeeze-and-Excitation Networks (SENet)

Sep 25, 2024 10:45 PM - 4 months ago 152033

During the early days of attraction mechanisms successful machine vision, 1 insubstantial published astatine CVPR 2018 (and TPAMI), Squeeze and Excitation Networks, introduced a caller transmission attraction mechanism. This elemental yet businesslike add-on module tin beryllium added to immoderate baseline architecture to get an betterment successful performance, pinch negligible computational overhead.

In this article we’ll screen Squeeze-and-Excitation Networks successful 4 parts. First, we will understand the intuition down why transmission attraction is important by visiting immoderate aspects successful modern photography techniques. Then we’ll beforehand to the methodology progressive successful computing the transmission attraction successful Squeeze-and-Excitation (SE) blocks. Following this, we will dissect the effect of Squeeze-and-Excitation (SE) blocks successful modular architectures, while evaluating them connected different machine imagination tasks. We will extremity pinch a critique of the insubstantial regarding definite shortcomings successful the method proposed.

Table of Contents

  • Frame Selection successful Modern Photography
  • Channel Attention successful Convolutional Neural Networks
  • Squeeze-and-Excitation Networks
  • Code
  • MBConv successful Efficient Nets
  • Benchmarks
  • Shortcomings
  • References

Prerequisites

  • Python: to tally the codification present within, your instrumentality will request Python installed. Readers should person basal Python coding acquisition earlier continuing
  • Deep Learning basics: This article covers concepts basal to applying Deep Learning theory, and readers are expected to person immoderate acquisition pinch applicable position and basal theory.

Frame Selection successful Modern Photography

Frame Shots successful Pixel 2 Frame Shots successful Pixel 2

As modern photography has grown done generations of improvements successful intelligent mechanisms to seizure the champion shot, 1 of the astir subtle techniques that has gone nether the radar is action of the best framework shots for a still photo. This is simply a communal characteristic successful definite smart phones.

There are galore variables successful a still photograph. Two photographs of a taxable taken a 2nd isolated from each other, nether the aforesaid conditions and successful the aforesaid environment, tin still disagree a lot. For example, their eyes mightiness beryllium closed successful 1 of the 2 photographs. To get the champion shot, it is overmuch amended to seizure aggregate frames astatine the instant the photograph is taken, truthful that the photographer has the prime of selecting the champion framework from each frames captured. Nowadays this is done successful an automated, intelligent fashion. Smart phones for illustration the Google Pixel person the expertise to prime the champion framework from each the disposable frames taken erstwhile capturing a azygous photograph. This intelligent system is conditioned by different factors for illustration the lighting, contrast, blur, inheritance distortion, etc. In an absurd way, the intelligent system is selecting the framework which contains the champion typical accusation of the photograph.

In position of modern convolutional neural web architectures, you tin deliberation of the frames arsenic the channels successful a tensor computed by a convolutional layer. This tensor is usually denoted by a (B, C, H, W) dimensionality, wherever B refers to the batch size, C refers to the channels, and the H, W refers to the spatial dimensions of the characteristic maps (H represents the tallness and W represents the width). The channels are the consequence of the convolutional filters deriving different features from the input. However, the channels mightiness not person the aforesaid typical importance. As immoderate channels mightiness beryllium much important than others, it makes consciousness to use a weight to the channels based connected their value earlier propagating to the adjacent layer.

We will usage this arsenic a foundational knowing of the value of transmission attention, which we will spell done successful the pursuing sections.

Channel Attention

Based connected the intuition described successful the erstwhile section, let’s spell in-depth into why transmission attraction is simply a important constituent for improving generalization capabilities of a heavy convolutional neural web architecture.

To recap, successful a convolutional neural network, location are 2 awesome components:

  1. The input tensor (usually a four-dimensional tensor) represented by the dimensions (B, C, H, W).
  2. The trainable convolutional filters which incorporate the weights for that layer.

The convolutional filters are responsible for constructing the characteristic maps based connected the learned weights wrong those filters. While immoderate filters study edges, others study textures, and collectively they study different characteristic representations of the target people accusation wrong the image embedded by the input tensor. Thus, the number of channels represents the number of convolutional filters which study the different characteristic maps of the input. From our erstwhile knowing of framework action successful photography, these characteristic maps besides person a different magnitude of importance. This intends that immoderate characteristic maps are much important than others. For example, a characteristic representation containing the separator accusation mightiness beryllium much important and important for learning, compared to different characteristic representation learning inheritance texture transitions. Thus, astatine a basal level, 1 would want to supply the “more important” characteristic maps pinch a higher grade of value compared to the counterpart characteristic maps.

Example Feature Maps Example Feature Maps

This is the instauration for transmission attention. We want to attraction this “attention” connected much important channels, which is fundamentally to springiness higher value to circumstantial channels complete others. The simplest measurement to do this by scaling the much important channels by a higher value. This is precisely what Squeeze-Excitation Networks propose.

Squeeze-and-Excitation Networks

In 2018, Hu et al. published their insubstantial titled Squeeze-and-Excitation Networks astatine CVPR 2018 pinch a diary type successful TPAMI. Hailed to beryllium 1 of the astir influential useful successful the domain of attraction mechanisms, the insubstantial has garnered complete 1000 citations. Let’s return a look astatine what the insubstantial proposes.

Squeeze-Excitation Module Squeeze-Excitation Module

The insubstantial proposes a novel, easy-to-plug-in module called a Squeeze-and-Excite artifact (abbreviated arsenic SE-block) which consists of 3 components (shown successful the fig above):

  1. Squeeze Module
  2. Excitation Module
  3. Scale Module

Let’s spell done each of these modules successful much specifications and understand why they’re important successful the discourse of transmission attention.

Squeeze Module

To get optimal transmission attention, 1 would want the scaling of the characteristic maps to beryllium adaptive to the characteristic maps themselves. To recap, the characteristic representation group is fundamentally the output tensor from a convolutional furniture (usually a 4-D tensor of dimensionality (B,C,H,W), wherever the initials correspond the Batch size, Channels, Height and Width of the characteristic maps). For simplicity we will only see it arsenic a 3-D tensor of the style (C, H, W)–essentially we are concerned pinch the extent (number of channels/feature maps successful the tensor) and the spatial dimensions of each characteristic representation successful that tensor. Thus, to make transmission attraction adaptive to each transmission itself, we person H_×_W pixels (or values) successful full to beryllium concerned with. This would fundamentally mean that to make the attraction genuinely adaptive, you’d beryllium operating pinch a full of C_×_H_×_W values successful total. This worth will get very ample because successful modern neural networks, the number of channels becomes larger pinch an expanding extent of the network. Thus, the request for utilizing a characteristic descriptor which tin decompose the accusation of each characteristic representation to a singular worth will beryllium adjuvant successful reducing the computational complexity of the full operation.

This forms the information for the Squeeze Module. There beryllium galore characteristic descriptors which tin beryllium utilized to trim the spatial dimensions of the characteristic maps to a singular value, but a wide method utilized for simplification of spatial size successful convolutional neural networks is pooling. There are 2 very celebrated methods of pooling: Average Pooling and Max Pooling. The erstwhile computes the mean pixel values wrong a defined window, while the second takes the maximum pixel worth successful the aforesaid defined window. Both person their adjacent stock of advantages and disadvantages. While max pooling preserves the astir activating pixels, it besides tin beryllium highly noisy and won’t see the neighboring pixels. Average pooling, connected the different hand, doesn’t sphere the information; however, it constructs a smoother mean of each the pixels successful that window.

The authors did an ablation study to analyse the capacity of each descriptor, namely, Global Average Pool (GAP) and Global Max Pool (GMP), which is shown successful the pursuing table.

Descriptor Top-1 Error Rate Top-5 Error Rate
GMP 22.57 6.09
GAP 22.28 6.03

The Squeeze Module frankincense opts for the smoother action of the two, and uses the Global Average Pool (GAP) cognition which fundamentally reduces the full characteristic representation to a singular worth by taking the mean of each pixels successful that characteristic map. Thus, successful position of dimensionality, if the input tensor is (C_×_H_×_W), past aft passing it done the GAP usability the output tensor obtained will beryllium of style (_C_×1×1), fundamentally a vector of magnitude C wherever each characteristic representation is now decomposed to a singular value.

To verify the value of the Squeeze operator, the authors further compared a Squeeze version and a No-Squeeze variant, arsenic shown successful the pursuing table. Note: a No-Squeeze version fundamentally intends that the tensor containing the characteristic maps was not reduced to a azygous pixel, and the Excitation module operated connected the full tensor.

Variant Top-1 Error Rate Top-5 Error Rate GFLOPs Parameters
Vanilla ResNet-50 23.30 6.55 3.86 25.6M
NoSqueeze 22.93 6.39 4.27 28.1M
SE 22.28 6.03 3.87 28.1M

Excitation Module

Example of a Multi-Layer Perceptron (MLP) structure. Example of a Multi-Layer Perceptron (MLP) structure.

Now pinch the input tensor decomposed to a considerably smaller size of (C×1×1), the adjacent portion of the module is to study the adaptive scaling weights for these channels. For the Excitation Module successful the Squeeze-and-Excitation Block, the authors opt for a afloat connected Multi-Layer Perceptron (MLP) bottleneck building to representation the scaling weights. This MLP bottleneck has a azygous hidden furniture on pinch the input and output layer, which are of the aforesaid shape. The hidden furniture is utilized arsenic a simplification artifact wherever the input abstraction is reduced to a smaller abstraction defined by the simplification facet (which is group astatine 16 by default). The compressed abstraction is past expanded backmost to the original dimensionality arsenic the input tensor. In much compact terms, the changes successful dimensionality astatine each furniture of the MLP tin beryllium defined by the pursuing 3 points:

  1. Input is of style (_C_×1×1). Thus, location are C neurons successful the input layer.
  2. Hidden furniture reduces this by a simplification facet r, frankincense starring to a full number of C/r neurons.
  3. Finally, the output is projected backmost to the aforesaid dimensional abstraction arsenic the input, returning to C neurons successful total.

In total, you walk the (_C_×1×1) tensor arsenic input and get a weighted tensor of the aforesaid shape–(_C_×1×1).

The authors supply results for experiments connected the capacity of a SE module successful a ResNet-50 architecture utilizing different simplification ratios (r), arsenic shown successful the array below.

r Top-1 Error Rate Top-5 Error Rate Parameters
2 22.29 6.00 45.7M
4 22.25 6.09 35.7M
8 2.26 5.99 30.7M
16 22.28 6.03 28.1M
32 22.72 6.20 26.9M
Vanilla 23.30 6.55 25.6M

Ideally, for improved accusation propagation and amended cross-channel relationship (CCI), r should beryllium group to 1, frankincense making it a fully-connected quadrate web pinch the aforesaid width astatine each layer. However, location exists a trade-off betwixt expanding complexity and capacity betterment pinch decreasing r. Thus, based connected the supra table, the authors usage 16 arsenic the default worth for the simplification ratio. This is simply a hyperparameter pinch a scope for further tuning to amended upon performance.

Scale Module

After getting the (_C_×1×1) “excited” tensor from the Excitation Module, it is first passed done a sigmoid activation furniture which scales the values to a scope of 0-1. Subsequently the output is applied straight to the input by a elemental broadcasted element-wise multiplication, which scales each channel/feature representation successful the input tensor pinch it’s corresponding learned weight from the MLP successful the Excitation module.

The authors did further ablation studies connected the effect of different non-linear activation functions to beryllium utilized arsenic the excitation operator, arsenic shown successful the pursuing table.

Activation Function Top-1 Error Rate Top-5 Error Rate
ReLU 23.47 6.98
Tanh 23.00 6.38
Sigmoid 22.28 6.03

Based connected the results, the authors found Sigmoid to beryllium the best-performing activation function, and frankincense employment it arsenic the default excitation usability successful the standard module.

To summarize, the Squeeze Excitation Block (SE Block) takes an input tensor x of style (C_×_H_×_W), reduces it to a tensor of style (_C_×1×1) by Global Average Pooling (GAP), and subsequently passes this C-length vector into a Multi-Layer Perceptron (MLP) bottleneck structure, and outputs a weighted tensor of the aforesaid style (_C_×1×1) which is past broadcasted and multiplied element-wise pinch the input x.

Now, the mobility is: wherever is the module “plugged into”, e.g. successful a Residual Network?

SE artifact integration designs explored successful the ablation study. SE artifact integration designs explored successful the ablation study.

The authors tried retired different integration strategies for the SE block, arsenic shown successful the supra diagram. These include:

  1. Standard SE
  2. SE-PRE
  3. SE-POST
  4. SE-Identity

The modular SE artifact is applied correct aft the last convolutional furniture of the architecture, successful this lawsuit of a Residual Network, correct earlier the merging of the skip connection. The SE-PRE configuration was constructed by placing the SE artifact astatine the commencement of the block, earlier the first convolutional layer, while SE-POST did the other by placing it astatine the extremity of the artifact (after the merging of the skip connection). Finally, the SE-Identity artifact applied the SE-module successful the skip relationship branch itself, parallel to the main block, and is added to the last output arsenic a normal residual.

The authors provided the results of their extended ablation studies connected the integration strategies, shown successful the pursuing 2 tables:

Table 1. Effect of different SE Integration Strategy connected the correction rates of ResNet-50 successful ImageNet classification task
Strategy Top-1 Error Rate Top-5 Error Rate
SE 22.28 6.03
SE-PRE 22.23 6.00
SE-POST 22.78 6.35
SE-Identity 22.20 6.15
Table 2. Effect of introducing SE-block aft the spatial 3x3 convolutional furniture successful a Residual block
Design Top-1 Error Rate Top-5 Error Rate GFLOPs Parameters
SE 22.28 6.03 3.87 28.1M
SE-3×3 22.48 6.02 3.86 25.8M

As we tin spot from the Table 1, each configuration isolated from SE-POST provided a akin and accordant performance. As demonstrated successful Table 2, the authors further experimented pinch inserting the SE-block aft the spatial convolution successful the residual block. Since the 3×3 spatial convolution has little number of channels, the parameter and FLOPs overhead is overmuch smaller. While it is capable to supply akin capacity compared to the default SE configuration, the authors didn’t supply immoderate conclusive connection connected which configuration is astir favorable, and kept “SE” arsenic the default integration configuration.

To our fortune, the authors do reply the mobility of really to merge SE-blocks into existing architectures.

SE-ResNet Module SE-ResNet Module

In a Residual Network, the Squeeze-Excitation artifact is plugged successful aft the last convolutional layer, successful the artifact anterior to the summation of the residual successful the skip connection. The intuition down this is to support the skip relationship branch arsenic cleanable arsenic imaginable to make learning the personality easy.

SE-Inception Module SE-Inception Module

However, successful an Inception Network, because of the absence of skip connections, the SE-block is inserted successful each Inception artifact aft the last convolutional layer.

In the pursuing fig from the paper, the authors show the modified ResNet-50 and ResNext-50 architectures pinch an SE-module successful each block.

SE-Based Architecture Designs SE-Based Architecture Designs

The authors studied extensively the integration strategy of the SE-block successful the 4 different stages successful a ResNet-50. The results are shown successful the pursuing table.

Stage Top-1 Error Rate Top-5 Error Rate GFLOPs Parameters
ResNet-50 23.30 6.55 3.86 25.6M
SE Stage 2 23.03 6.48 3.86 25.6M
SE Stage 3 23.04 6.32 3.86 25.7M
SE Stage 4 22.68 6.22 3.86 26.4M
SE All 22.28 6.03 3.87 28.1M

Code

The charismatic codification repository associated pinch the insubstantial tin beryllium recovered here. However, the codification is system successful Caffe–a little celebrated model nowadays. Let’s return a look astatine the PyTorch and TensorFlow versions of the module.

PyTorch

### Import basal packages from torch import nn ### Squeeze and Excitation Class definition class SE(nn.Module): def __init__(self, channel, reduction_ratio =16): super(SE, self).__init__() ### Global Average Pooling self.gap = nn.AdaptiveAvgPool2d(1) ### Fully Connected Multi-Layer Perceptron (FC-MLP) self.mlp = nn.Sequential( nn.Linear(channel, transmission // reduction_ratio, bias=False), nn.ReLU(inplace=True), nn.Linear(channel // reduction_ratio, channel, bias=False), nn.Sigmoid() ) def forward(self, x): b, c, _, _ = x.size() y = self.gap(x).view(b, c) y = self.mlp(y).view(b, c, 1, 1) return x * y.expand_as(x)

TensorFlow

import tensorflow arsenic tf __all__ = [ 'squeeze_and_excitation_block', ] def squeeze_and_excitation_block(input_X, out_dim, reduction_ratio=16, layer_name='SE-block'): """Squeeze-and-Excitation (SE) Block SE artifact to execute characteristic recalibration - a system that allows the web to execute characteristic recalibration, done which it can study to usage world accusation to selectively emphasise informative features and suppress little useful ones """ pinch tf.name_scope(layer_name): # Squeeze: Global Information Embedding compression = tf.nn.avg_pool(input_X, ksize=[1, *input_X.shape[1:3], 1], strides=[1, 1, 1, 1], padding='VALID', name='squeeze') # Excitation: Adaptive Feature Recalibration ## Dense (Bottleneck) -> ReLU pinch tf.variable_scope(layer_name+'-variables'): excitation = tf.layers.dense(squeeze, units=out_dim/reduction_ratio, name='excitation-bottleneck') excitation = tf.nn.relu(excitation, name='excitation-bottleneck-relu') ## Dense -> Sigmoid pinch tf.variable_scope(layer_name+'-variables'): excitation = tf.layers.dense(excitation, units=out_dim, name='excitation') excitation = tf.nn.sigmoid(excitation, name='excitation-sigmoid') # Scaling scaler = tf.reshape(excitation, shape=[-1, 1, 1, out_dim], name='scaler') return input_X * scaler

MBConv successful Efficient Nets

Some of the astir influential activity which incorporates the Squeeze-Excitation artifact is that of MobileNet v2 and Efficient Nets, some of which usage the mobile inverted residual artifact (MBConv). Efficient Nets adhd a Squeeze-Excitation artifact arsenic well.

MBConv Blocks successful Efficient Nets MBConv Blocks successful Efficient Nets

In the MBConv, the Squeeze-Excitation artifact is placed earlier the last convolutional layer, aft the spatial convolution successful the block. This makes it much of an integral portion alternatively than an add-on, for which it was primitively intended. The authors of SE-Net had conducted ablation studies and besides tested this integration method, nevertheless they opted for the default configuration of adding an SE-block aft the last 1×1 convolution. Efficient Nets are considered to beryllium authorities of the creation (SOTA) successful galore tasks, ranging from Image Classification connected the modular ImageNet-1k dataset to Object Detection connected the MS-COCO dataset. This is simply a testament to the value of transmission attention, and the ratio of Squeeze Excitation blocks.

Benchmarks

The authors supply extended results connected different tasks for illustration Image Classification, Scene Classification, and Object Detection, connected competitory modular datasets for illustration ImageNet, MS-COCO and Places-365. The pursuing array showcases the ratio and advantages of utilizing SE modules successful the tasks mentioned above:

CIFAR-10 Classification Task

Architecture Vanilla SE-variant
ResNet-110 6.37 5.21
ResNet-164 5.46 4.39
WRN-16-8 4.27 3.88
Shake-Shake 26 2x96d + Cutout 2.56 2.12

The metric utilized for comparison present is the classification error. The authors besides added a shape of information augmentation, namely, Cutout successful the Shake-Shake network, to corroborate whether the capacity betterment obtained while utilizing the SE-module is accordant pinch the usage of different performance-boosting techniques for illustration that of information augmentation.

CIFAR-100 Classification Task

Architecture Vanilla SE-variant
ResNet-110 26.88 23.85
ResNet-164 24.33 21.31
WRN-16-8 20.43 19.14
Shake-Shake 26 2x96d + Cutout 15.85 15.41

ImageNet-1k Classification Task

ImageNet classification capacity comparison for modular heavy architectures ImageNet classification capacity comparison for modular heavy architectures

ImageNet classification capacity comparison for ray mobile architectures ImageNet classification capacity comparison for ray mobile architectures

Training Dynamics

Training curves of different networks pinch and without Squeeze Excitation (SE). Training curves of different networks pinch and without Squeeze Excitation (SE).

As shown successful the graphs above, networks equipped pinch Squeeze-and-Excitation modules show a accordant improved curve, which frankincense leads to amended generalization and higher performance.

Scene Classification Task connected Places-365 Dataset

Architecture Top-1 Error Rate Top-5 Error Rate
Places-365-CNN 41.07 11.48
ResNet-152 41.15 11.61
SE-ResNet-152 40.37 11.01

Object Detection Task connected MS-COCO Dataset utilizing a Faster RCNN

Backbone AP@IoU=0.5 AP
ResNet-50 57.9 38.0
SE-ResNet-50 61.0 40.4
ResNet-101 60.1 39.9
SE-ResNet-101 62.7 41.9

Shortcomings

Although the insubstantial is revolutionary successful its ain right, location are definite outlined flaws successful the building and immoderate inconclusive creation strategies.

  1. The method is rather costly and adds a important magnitude of parameters and FLOPS connected apical of the baseline model. Although successful the expansive strategy of things this overhead mightiness beryllium very minimal, location person been galore caller approaches aimed astatine providing transmission attraction astatine an highly inexpensive costs which person performed amended than SENets, for lawsuit ECANet (published astatine CVPR 2020).
  2. Although transmission attraction seems to beryllium businesslike successful position of parameters and FLOPs overhead, 1 awesome flaw is the scaling cognition wherever the weighted transmission vector is broadcasted and applied/multiplied element-wise to the input tensor. This intermediate broadcasted tensor is of the aforesaid dimensional abstraction arsenic that of the input, causing an summation successful representation complexity by a ample margin. This renders the training process slower and much memory-intensive.
  3. To trim computational complexity location exists a bottleneck building successful the MLP of the excitation module of the block, wherever the number of channels are reduced by a specified simplification ratio. This causes accusation nonaccomplishment and is frankincense sub-optimal.
  4. Since SENet only revolves astir providing transmission attraction by utilizing dedicated world characteristic descriptors, which successful this lawsuit is Global Average Pooling (GAP), location is simply a nonaccomplishment of accusation and the attraction provided is point-wise. This intends that each pixels are mapped successful the spatial domain of a characteristic representation uniformly, and frankincense not discriminating betwixt important aliases class-deterministic pixels versus those which are portion of the inheritance aliases not containing useful information. Thus, the importance/need for spatial attraction is justified to beryllium coupled pinch transmission attention. One of the premier examples of the aforesaid is CBAM (published astatine ECCV 2018).
  5. There are inconclusive creation strategies regarding SENet. The authors stated that this is beyond the scope of the insubstantial to understand what the optimal mounting would be, which includes the positional integration strategy for SE modules and the simplification ratio to beryllium utilized successful the MLP.

References

  1. Squeeze-and-Excitation Networks, TPAMI 2018.
  2. CBAM: Convolutional Block Attention Module, ECCV 2018.
  3. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks, CVPR 2020.
  4. SENet original repository.
More