Convolutional Autoencoders

Sep 17, 2024 12:26 AM - 3 months ago 137388

Convolutional neural networks return successful a 2-dimensional spatial system information lawsuit (an image), and process it until a 1-dimensional vector practice of immoderate benignant is produced. It the begs the question, if a mapping tin beryllium learnt from an image matrix to a vector representation, possibly a mapping tin beryllium learnt from that vector practice backmost to an image. In this demo article, we will beryllium exploring conscionable that.

Prerequisites

A basal knowing of Python codification and Neural Networks is needed to travel on pinch this tutorial. We urge this article to intermediate to precocious coders pinch acquisition processing caller architectures.

The codification successful this article tin beryllium executed connected a normal location PC aliases DigitalOcean Droplet.

In erstwhile articles, I had touched connected the truth that convolution layers successful a convnet service the intent of extracting features from images. Those features are past passed unto linear layers which execute the existent classification (exceptions are made for architectures that utilize 1 x 1 convolution layers for downsampling).

image

Consider VGG-16 pinch it’s architecture depicted above, from the input furniture correct till the constituent wherever the pooled 7 x 7 x 512 characteristic maps are flattened to create a vector of size 25,088 elements: that information of the web serves arsenic a characteristic extractor. Essentially, a 224 x 224 image pinch a full of 50,176 pixels is processed to create a 25,088 constituent characteristic vector, and this characteristic vector is past passed to the linear layers for classification.

Since these features are extracted by a convnet, it is logical to presume that different convnet could perchance make consciousness of these features, and put the original image that those features beryllium to backmost together, fundamentally reversing the characteristic extraction process. This is fundamentally what an Autoencoder does.

Structure of an Autoencoder

As stated successful the erstwhile section, autoencoders are heavy learning architectures tin of reconstructing information instances from their characteristic vectors. They activity connected each sorts of information but this article is chiefly concerned pinch their exertion connected image data. An autoencoder is made up of 3 main components; namely, an encoder, a bottleneck and a decoder.

image

Encoder

The first conception of an autoencoder, the encoder is the convnet that acts specifically arsenic a characteristic extractor. Its superior usability is to thief extract the astir salient features from images and return them arsenic a vector.

Bottleneck

Located correct aft the encoder, the bottleneck, besides called a codification layer, serves arsenic an other furniture which helps to compress the extracted features into a smaller vector representation. This is done successful a bid to make it much difficult for the decoder to make consciousness of the features and unit it to study much analyzable mappings.

Decoder

The past conception of an autoencoder, the decoder is that convnet which attempts to make consciousness of the features coming from the encoder, which person been subsequently compressed successful the bottleneck, truthful arsenic to reconstruct the original image arsenic it was.

Training an Autoencoder

In this conception we shall beryllium implementing an autoencoder from scratch successful PyTorch and training it connected a circumstantial dataset.

Let’s commencement by quickly importing our required packages.

# article dependencies import torch import torch.nn arsenic nn import torch.nn.functional arsenic F import torchvision import torchvision.transforms arsenic transforms import torchvision.datasets arsenic Datasets from torch.utils.data import Dataset, DataLoader import numpy arsenic np import matplotlib.pyplot arsenic plt import cv2 from tqdm.notebook import tqdm from tqdm import tqdm arsenic tqdm_regular import seaborn arsenic sns from torchvision.utils import make_grid import random # configuring device if torch.cuda.is_available(): instrumentality = torch.device('cuda:0') print('Running connected the GPU') else: instrumentality = torch.device('cpu') print('Running connected the CPU')

Preparing Data

For the intent of this article, we will utilize the CIFAR-10 dataset successful training a convolutional autoencoder. It tin beryllium loaded arsenic seen successful the codification compartment below.

# loading training data training_set = Datasets.CIFAR10(root='./', download=True, transform=transforms.ToTensor()) # loading validation data validation_set = Datasets.CIFAR10(root='./', download=True, train=False, transform=transforms.ToTensor())

Next we request to extract only the images from the dataset. Since we are trying to thatch an autoencoder to reconstruct images, the targets will not beryllium people labels but the existent images themselves. An image from each people is besides extracted and stored successful the entity ‘test_images’ conscionable for visualization purposes, much connected this later.

def extract_each_class(dataset): """ This usability searches for and returns 1 image per class """ images = [] ITERATE = True one = 0 j = 0 while ITERATE: for explanation successful tqdm_regular(dataset.targets): if label==j: images.append(dataset.data[i]) print(f'class {j} found') i+=1 j+=1 if j==10: ITERATE = False else: i+=1 return images # extracting training images training_images = [x for x successful training_set.data] # extracting validation images validation_images = [x for x successful validation_set.data] # extracting trial images for visualization purposes test_images = extract_each_class(validation_set)

Now we request to specify a PyTorch dataset people truthful arsenic to beryllium capable to usage the images arsenic tensors. This on pinch people instantiation is done successful the codification compartment below.

# defining dataset class class CustomCIFAR10(Dataset): def __init__(self, data, transforms=None): self.data = data self.transforms = transforms def __len__(self): return len(self.data) def __getitem__(self, idx): image = self.data[idx] if self.transforms!=None: image = self.transforms(image) return image # creating pytorch datasets training_data = CustomCIFAR10(training_images, transforms=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])) validation_data = CustomCIFAR10(validation_images, transforms=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])) test_data = CustomCIFAR10(test_images, transforms=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]))

Autoencoder Architecture

A civilization convolutional autoencoder architecture is defined for the intent of this article arsenic illustrated below. This architecture is designed to activity pinch the CIFAR-10 dataset arsenic its encoder takes successful 32 x 32 pixel images pinch 3 channels and processes them until 64 8 x 8 characteristic maps are produced.

These characteristic maps are past flattened to nutrient a vector of 4096 elements which is past compressed to conscionable 200 elements successful the bottleneck. The decoder takes this 200 constituent vector practice and processes it via transposed convolution until a 3 x 32 x 32 image is returned arsenic output.

image

The supra defined architecture is implemented successful the codification compartment below. The parameter ‘latent_dim’ successful this lawsuit refers to the size of the bottleneck which we person specified to beryllium 200.

# defining encoder class Encoder(nn.Module): def __init__(self, in_channels=3, out_channels=16, latent_dim=200, act_fn=nn.ReLU()): super().__init__() self.net = nn.Sequential( nn.Conv2d(in_channels, out_channels, 3, padding=1), # (32, 32) act_fn, nn.Conv2d(out_channels, out_channels, 3, padding=1), act_fn, nn.Conv2d(out_channels, 2*out_channels, 3, padding=1, stride=2), # (16, 16) act_fn, nn.Conv2d(2*out_channels, 2*out_channels, 3, padding=1), act_fn, nn.Conv2d(2*out_channels, 4*out_channels, 3, padding=1, stride=2), # (8, 8) act_fn, nn.Conv2d(4*out_channels, 4*out_channels, 3, padding=1), act_fn, nn.Flatten(), nn.Linear(4*out_channels*8*8, latent_dim), act_fn ) def forward(self, x): x = x.view(-1, 3, 32, 32) output = self.net(x) return output # defining decoder class Decoder(nn.Module): def __init__(self, in_channels=3, out_channels=16, latent_dim=200, act_fn=nn.ReLU()): super().__init__() self.out_channels = out_channels self.linear = nn.Sequential( nn.Linear(latent_dim, 4*out_channels*8*8), act_fn ) self.conv = nn.Sequential( nn.ConvTranspose2d(4*out_channels, 4*out_channels, 3, padding=1), # (8, 8) act_fn, nn.ConvTranspose2d(4*out_channels, 2*out_channels, 3, padding=1, stride=2, output_padding=1), # (16, 16) act_fn, nn.ConvTranspose2d(2*out_channels, 2*out_channels, 3, padding=1), act_fn, nn.ConvTranspose2d(2*out_channels, out_channels, 3, padding=1, stride=2, output_padding=1), # (32, 32) act_fn, nn.ConvTranspose2d(out_channels, out_channels, 3, padding=1), act_fn, nn.ConvTranspose2d(out_channels, in_channels, 3, padding=1) ) def forward(self, x): output = self.linear(x) output = output.view(-1, 4*self.out_channels, 8, 8) output = self.conv(output) return output # defining autoencoder class Autoencoder(nn.Module): def __init__(self, encoder, decoder): super().__init__() self.encoder = encoder self.encoder.to(device) self.decoder = decoder self.decoder.to(device) def forward(self, x): encoded = self.encoder(x) decoded = self.decoder(encoded) return decoded

Per usual, we now request to specify a people which will thief make training and validation much seamless. In this case, since we are training a generative model, losses mightiness not transportation excessively overmuch information. In general, we want nonaccomplishment to beryllium reduced, and we besides tin usage nonaccomplishment values to beryllium capable to spot really good the autoencoder reconstructs images for each epoch. For this reason, I person included a visualization artifact arsenic seen below.

class ConvolutionalAutoencoder(): def __init__(self, autoencoder): self.network = autoencoder self.optimizer = torch.optim.Adam(self.network.parameters(), lr=1e-3) def train(self, loss_function, epochs, batch_size, training_set, validation_set, test_set): # creating log log_dict = { 'training_loss_per_batch': [], 'validation_loss_per_batch': [], 'visualizations': [] } # defining weight initialization function def init_weights(module): if isinstance(module, nn.Conv2d): torch.nn.init.xavier_uniform_(module.weight) module.bias.data.fill_(0.01) elif isinstance(module, nn.Linear): torch.nn.init.xavier_uniform_(module.weight) module.bias.data.fill_(0.01) # initializing web weights self.network.apply(init_weights) # creating dataloaders train_loader = DataLoader(training_set, batch_size) val_loader = DataLoader(validation_set, batch_size) test_loader = DataLoader(test_set, 10) # mounting convnet to training mode self.network.train() self.network.to(device) for epoch successful range(epochs): print(f'Epoch {epoch+1}/{epochs}') train_losses = [] #------------ # TRAINING #------------ print('training...') for images successful tqdm(train_loader): # zeroing gradients self.optimizer.zero_grad() # sending images to device images = images.to(device) # reconstructing images output = self.network(images) # computing loss nonaccomplishment = loss_function(output, images.view(-1, 3, 32, 32)) # calculating gradients loss.backward() # optimizing weights self.optimizer.step() #-------------- # LOGGING #-------------- log_dict['training_loss_per_batch'].append(loss.item()) #-------------- # VALIDATION #-------------- print('validating...') for val_images successful tqdm(val_loader): pinch torch.no_grad(): # sending validation images to device val_images = val_images.to(device) # reconstructing images output = self.network(val_images) # computing validation loss val_loss = loss_function(output, val_images.view(-1, 3, 32, 32)) #-------------- # LOGGING #-------------- log_dict['validation_loss_per_batch'].append(val_loss.item()) #-------------- # VISUALISATION #-------------- print(f'training_loss: {round(loss.item(), 4)} validation_loss: {round(val_loss.item(), 4)}') for test_images successful test_loader: # sending trial images to device test_images = test_images.to(device) pinch torch.no_grad(): # reconstructing trial images reconstructed_imgs = self.network(test_images) # sending reconstructed and images to cpu to let for visualization reconstructed_imgs = reconstructed_imgs.cpu() test_images = test_images.cpu() # visualisation imgs = torch.stack([test_images.view(-1, 3, 32, 32), reconstructed_imgs], dim=1).flatten(0,1) grid = make_grid(imgs, nrow=10, normalize=True, padding=1) grid = grid.permute(1, 2, 0) plt.figure(dpi=170) plt.title('Original/Reconstructed') plt.imshow(grid) log_dict['visualizations'].append(grid) plt.axis('off') plt.show() return log_dict def autoencode(self, x): return self.network(x) def encode(self, x): encoder = self.network.encoder return encoder(x) def decode(self, x): decoder = self.network.decoder return decoder(x)

With everything set, we tin past instantiate our autoencoder arsenic a personnel of the convolutional autoencoder people we defined beneath utilizing the parameters arsenic specified successful the codification compartment that follows.

# training model model = ConvolutionalAutoencoder(Autoencoder(Encoder(), Decoder())) log_dict = model.train(nn.MSELoss(), epochs=10, batch_size=64, training_set=training_data, validation_set=validation_data, test_set=test_data)

Right from the extremity of the first epoch, it is evident that our decoder has begun to create a consciousness of really to reconstruct images fed into the encoder, moreover though it only had entree to a compressed 200 constituent characteristic vector representation. Reconstructed images proceed to summation successful item correct up till the 10th epoch arsenic well.

Epoch 1 (top) vs Epoch 10 (bottom).

Looking astatine the training and validation losses, the autoencoder could still use somewhat from immoderate much epochs of training arsenic it’s losses are still down-trending. This is the lawsuit for the validation nonaccomplishment much truthful than training loss, which seems to beryllium plateauing.

image

Bottleneck and Details

In 1 of the erstwhile sections, I had mentioned really the bottleneck codification furniture serves the intent of further compressing a characteristic vector, truthful arsenic to unit the decoder to study a much analyzable and generalizable mapping. On the flip side, a good equilibrium is to beryllium sought arsenic the magnitude of compression successful the codification furniture would besides power really good a decoder tin reconstruct an image.

The smaller the vector practice passed to the decoder, the little image features the decoder has entree to and the little elaborate its reconstructions will be. In the aforesaid sense, the bigger the the vector practice passed to the decoder, the much image features it has entree to and the much elaborate its reconstructions will be. Following this statement of thinking, let’s train the aforesaid autoencoder architecture, but this clip utilizing a bottleneck of size 1000.

# training model model = ConvolutionalAutoencoder(Autoencoder(Encoder(latent_dim=1000), Decoder(latent_dim=1000))) log_dict = model.train(nn.MSELoss(), epochs=10, batch_size=64, training_set=training_data, validation_set=validation_data, test_set=test_data)

From the visualizations generated per epoch, it is instantly evident that the decoder does a amended occupation astatine reconstructing images successful position of item and ocular accuracy. This goes down to the truth that the caller decoder has entree to much features, arsenic the original characteristic vector of 4096 elements is now downsampled to 1000 elements alternatively of 200.

image

Epoch 1 (top) vs Epoch 10 (bottom).

Again, the autoencoder could use from immoderate much epochs of training. It’s training and validation losses are still down-trending, pinch slopes steeper than those we observed erstwhile we trained our autoencoder pinch a bottleneck of conscionable 200 elements.

image

Comparing bottlenecks of size 200 and 1000 some astatine the 10th epoch shows intelligibly that images generated pinch a bottleneck of 1000 elements are clearer/more elaborate than those generated pinch a bottleneck of 200 elements.

image

Bottleneck 200 (top) vs bottleneck 1000 (bottom) some astatine the 10th epoch.

Training to Absolute Refinement

At what constituent is simply a convolutional autoencoder optimally trained? From the 2 autoencoders we person trained we tin observe that reconstructions are still blurry astatine the 10th epoch moreover though our nonaccomplishment land had began to flatten. Increasing the bottleneck size will only ameliorate this rumor to an extent, but will not wholly lick it.

This is partially down to the nonaccomplishment usability utilized successful this case, mean squared error, arsenic it does not do excessively good while measuring losses successful generative models. For the astir part, these blurry reconstructions are the bane of convolutional autoencoders tasks. If one’s extremity is to reconstruct aliases make images, a generative adversarial web (GAN) aliases diffusion model whitethorn beryllium a safer bet. However, that is not to opportunity that convolutional autoencoders are not useful arsenic they tin beryllium utilized for anomaly detection, image denoising and truthful on.

In this article, we discussed autoencoders successful the discourse of image data. We went connected to return a look astatine what precisely a convolutional autoencoder does, and really it does it pinch a position astatine processing an intuition of it’s moving principle. Thereafter, we touched connected its different conception earlier going further to specify a civilization autoencoder of our own, training it and discussing the results of the exemplary training.

More