Visual Question Answering: a Survey

Nov 12, 2024 02:21 PM - 1 month ago 39689

We’ve seen a batch of advancements successful the past fewer years successful galore subdomains of instrumentality learning. Computer imagination tasks for illustration entity discovery and image segmentation, arsenic good arsenic NLP tasks for illustration entity recognition, connection generation, and mobility answering, are now being solved by neural networks and approached overmuch differently, pinch much velocity and higher accuracy.

A task that has grasped the attraction of the AI organization precocious is that of ocular mobility answering. This article will research the problem of ocular mobility answering, different approaches to lick it, associated challenges, datasets, and information methods.

Introduction

Visual mobility answering systems effort to correctly reply questions successful earthy connection regarding an image input. The broader thought of this problem is to creation systems that tin understand the contents of an image akin to really humans do, and pass efficaciously astir that image successful earthy language. This is simply a challenging task since it requires image-based models and earthy connection models to interact pinch and complement each other.

The problem has wide been accepted arsenic AI-complete, i.e. 1 that confronts the Artificial General Intelligence problem, namely making computers arsenic intelligent arsenic people. In fact, the problem has besides been suggested to beryllium utilized arsenic a Visual Turing Test by Geman et al. (2015).

To springiness you an thought of the subproblems the task of ocular mobility answering entails:

image

Source

Solutions to these problems impact 4 awesome steps:

  • Image featurization - converting images into their characteristic representations for further processing.
  • Question featurization - converting earthy connection questions into their embeddings for further processing.
  • Joint characteristic representation - ways of combining image features and the mobility features to heighten algorithmic understanding.
  • Answer generation - utilizing the associated features to understand the input image and the mobility asked, to yet make the correct answer.

Each shape successful this pipeline has been approached successful respective ways. We will look done the superior ones successful this article.

Prerequisites

  • Basic Python Programming: Familiarity pinch Python syntax and libraries for illustration NumPy and Pandas.
  • Linear Algebra: Understanding of matrices, vectors, and operations for illustration dot products and convolutions.
  • Calculus: Basics of derivatives and gradients for optimization.
  • Probability & Statistics: Basic concepts for illustration mean, variance, and distributions.
  • Deep Learning Fundamentals: Knowledge of neural networks, activation functions, and backpropagation.
  • PyTorch/TensorFlow Basics (Optional): Familiarity pinch heavy learning frameworks for implementation.

Image Featurization

Convolutional neural networks person go the gold-standard for shape nickname successful images. After an input image is passed done a convolutional network, it gets transformed into an absurd characteristic representation. Each select successful a CNN furniture captures different kinds of patterns, specified arsenic edges, vertices, contours, curves, and symmetries. This conception is explained very good successful this article, which discusses equivariance successful neural networks and really activation maps look for different kinds of detectors.

Check retired the 3-D visualization instrumentality for a CNN-based web trained connected MNIST information for hand-written digit nickname here.

image

Source

CNNs person evolved into deeper and much analyzable architectures which are wide utilized for downstream tasks for illustration classification, entity detection, and image segmentation. Some specified networks see AlexNet, ResNet, LeNet, SqueezeNet, VGGNet, ZFNet, etc.

Most VQA lit utilizes CNNs for image featurization. The network’s past furniture is removed and the remainder of the web is utilized for image characteristic generation. Sometimes, the 2nd to past furniture is normalized (Kafle et al. (2016) and Saito et al. (2017)), aliases passed done a dimensionality simplification furniture (Kafle et al. (2016), Ilievski et al. (2016)).

image

Source

From the study above, it’s clear that VGGNet was the web of prime earlier ResNets came along. Most of the VQA papers that came retired aft 2017 usage ResNets.

The halfway thought of ResNets impact skip connections, arsenic shown below.

image

The personality shortcut relationship allows the web to skip mediate layers. The thought is to debar exploding and vanishing gradients that very heavy neural networks often look by allowing the web to skip layers if needed. It besides helps amended accuracy, since layers hurting the accuracy tin beryllium skipped and regularized.

Ilievski et al. (2016) utilize mobility connection embeddings (which we will talk soon) to extract objects whose labels are akin to the mobility itself, and extract the characteristic representations of those objects utilizing a ResNet. They telephone this attack “focused move attention.”

Lu et al. (2019) utilize ViLBERT (short for Vision-And-Language BERT) for ocular mobility answering. ViLBERT consists of 2 parallel BERT-style models operating complete image regions and matter segments.  Each watercourse is simply a bid of transformer and co-attentional transformer layers which alteration accusation speech betwixt modalities.

Question Featurization

There are respective methods to create embeddings. Older approaches see count-based, frequency-based methods for illustration count vectorization and TF-IDF. There are prediction-based methods for illustration continuous container of words and skip grams arsenic well. Pretrained models for the Word2Vec algorithm are besides disposable successful unfastened root devices for illustration Gensim. You tin study astir these methods here. Deep learning architectures for illustration RNNs, LSTMs, GRUs, and 1-D CNNs tin besides beryllium utilized to create connection embeddings. In VQA literature, LSTMs are utilized astir frequently.

Most of you reference this article are astir apt already alert of what RNNs are, but we will still touch up connected the basal concepts for the liking of completeness. Recurrent neural networks return successful sequential inputs and foretell the adjacent constituent successful the series depending connected the information they’ve been trained on. Vanilla recurrent networks would, based connected the input provided, process their erstwhile hidden authorities and output the adjacent hidden authorities and a sequential prediction. This prediction is compared pinch the crushed truth values to update weights utilizing backpropagation.

We besides cognize that RNNs are susceptible to vanishing and exploding gradients. To hole this problem, LSTMs came into being. LSTMs usage different gates to negociate the magnitude of value fixed to each erstwhile constituent successful the sequence. There are besides bidirectional variants of LSTMs which study the sequential dependence of different elements from near to right, arsenic good arsenic correct to left.

image

Source

Lu et al. (2016) build a hierarchical architecture that co-attends to the image and mobility astatine 3 levels: (a) the connection level, (b) the building level, and © the mobility level. At the connection level, they embed words to a vector abstraction done an embedding matrix. At the building level, 1-dimensional convolutional neural networks are utilized to seizure the accusation contained successful unigrams, bigrams, and trigrams. At the mobility level, they usage recurrent neural networks to encode the full question.

This survey lists unsocial ways of doing mobility featurization, which are besides mentioned below.

Antol et al. (2015) usage a BoW approach, but usage only the apical 1000 words from the questions successful their dataset. They utilization the beardown relationship betwixt the words that commencement a question, and reply by creating different BoW from the apical 10 first, second, and 3rd words of the question, past concatenating it pinch the first representation.

Zhang et al. (2016) effort to lick the binary ocular mobility answering problem. They effort to shape the accusation successful a mobility by introducing a PRS structure, wherever P represents the superior object, R stands for relation, and S stands for the secondary object. P and S values would ideally beryllium nouns, and the R would beryllium verbs aliases prepositions.

To lick multiple-choice mobility answering, Shih et al. (2016) turned adaptable magnitude questions into fixed-sized vectors by binning the words into the pursuing categories:

  • Type of question, utilizing first 2 words
  • Nominal subject
  • All noun words
  • All remaining words

Yu et al. (2018) leverage Tree-LSTMs to seizure the semantics successful the mobility and its narration pinch the image. Tree-LSTMs are tree-structured objects, wherever each node is an LSTM cell. The guardant walk tin beryllium done successful galore ways:

  • Child Sum Tree Unit: present the hidden states of the children nodes are added to get the output
  • Child Max Pooling Unit: present the output is the maximum of each the kid nodes
  • Child Convolve + Max Pooling Unit: present the output is the maximum of the convolutions betwixt different kid nodes. A discrete convolution betwixt 2 functions $f$ and $g$ tin beryllium represented arsenic follows:

$$ (f * g)[n] = \sum_{m=-M}^{M} f[n - m]  g[m] $$

Each mobility successful the dataset is parsed and mapped to a character building wherever the guidelines node is group to the mobility sequence. The character building provides semantic building that immunodeficiency logical reasoning.

Toor et al. (2019) devised a method to understand the relevance of a mobility and besides make edits to an irrelevant mobility that has proved effective. They telephone the method Question Action Relevance and Editing.

Joint Feature Representation

One of the astir communal ways of handling the different characteristic vectors coming from images and matter is by conscionable concatenating the 2 and letting later layers find the correct weights for each. You tin besides effort element-wise summation and multiplication, if the characteristic vectors are of the aforesaid length. Malinowski et al. (2017) tried each the methods mentioned supra and recovered that element-wise multiplication gives the champion accuracy. Shih et al. (2016) usage a dot product, whereas Saito et al. (2017) usage a hybrid approach. They concatenate the consequence from element-wise multiplication and element-wise addition.

Several papers person utilized Canonical Correlation Analysis to find associated characteristic representations arsenic well. Canonical Correlation Analysis is simply a method of uncovering correlations betwixt 2 independent sets of vectors. You tin measure different linear combinations of some vectors, akin to what is done successful PCA.

Let $X$ beryllium a vector of magnitude $p$. Then,

$$ U_{1} = a_{11}X_{1} + a_{12}X_{2} + … + a_{1p}X_{p} $$

$$ U_{2} = a_{21}X_{1} + a_{22}X_{2} + … + a_{2p}X_{p} $$

…                                     …

$$ U_{p} = a_{p1}X_{1} + a_{p2}X_{2} + … + a_{pp}X_{p} $$

Let $Y$ beryllium a vector of magnitude $q$. Then,

$$ V_{1} = b_{11}Y_{1} + b_{12}Y_{2} + … + b_{1q}Y_{q} $$

$$ V_{2} = b_{21}Y_{1} + b_{22}Y_{2} + … + b_{2q}Y_{q} $$

…                                     …

$$ V_{q} = b_{q1}Y_{1} + b_{q2}Y_{2} + … + b_{qq}Y_{q} $$

Then $ (U_{i}, V_{i}) $ is the $ i^{th} $ canonical variate pair. Out of the 2 vectors, you prime $X$ specified that $p \leq q $ for computational convenience. Then location are $p$ canonical covariate pairs.

Covariance tin beryllium defined arsenic follows:

$$ cov(x, y) = \frac{\sum (x_{i} - \bar{x})(y_{i} - \bar{y})}{N - 1}$$

We tin compute the variances of $ U_{i} $ and $ V{i} $:

$$ var(U_{i}) = \sum_{k=1}^{p} \sum_{l=1}^{p} a_{ik}a_{il} cov(X_{k}, X_{l}) $$

$$ var(V_{i}) = \sum_{k=1}^{q} \sum_{l=1}^{q} b_{ik}b_{il} cov(Y_{k}, Y_{l}) $$

Then canonical relationship betwixt $ U_{i} $ and $ V_{i} $ tin beryllium calculated utilizing the pursuing formula:

$$ \rho_{i}^{*} = \frac{cov(U_{i}, V_{i})}{\sqrt{var(U_{i}) var(V_{i})}} $$

To find the associated characteristic practice we request to maximize the relationship betwixt U and V, aliases the worth of $ \rho $. In the scikit-learn implementation, this is accomplished successful the Partial Least Squares algorithm. Kernel CCA is different version which utilizes a Lagrangian-based solution. Gong et al. (2014), Yu et al. (2015), and Tomassi et al. (2019) each usage immoderate version of CCA successful the associated characteristic representation.

Noh et al. (2015) designed Dynamic Parameter Prediction Networks (DPPNets) for this task. They adhd a fully-connected furniture aft a mobility is vectorized utilizing a GRU to dynamically delegate weights to each question, earlier fusing pinch image features. They recovered this was challenging erstwhile the characteristic vectors had a ample number of parameters. To circumvent this, they utilized a hashing system to create a associated characteristic representation. They use a hashing instrumentality projected by Chen et al. (2015) for compression of neural networks. A hashing algorithm is utilized to group different parameters, and each group of parameters shares the aforesaid values. The hashing instrumentality drastically reduces exemplary sizes by exploiting redundancies successful a neural web without hurting the exemplary capacity successful immoderate important manner.

Fukui et al. (2016) utilize multi-modal bilinear pooling for associated characteristic creation. Bilinear models return the outer merchandise of 2 vectors to make a higher dimensional matrix, successful which each parameter of a vector interacts pinch the parameters of different successful a multiplicative manner. For ample vector sizes, this tin create immense models pinch excessively galore trainable parameters.

image

Source

To debar this, they usage thing called the Count Sketch Projection Function (as projected by Charikar et al. (2002), an algorithm designed to find the astir predominant values successful a information stream, and Fourier Transforms. In Count Sketch Projection, 2 vectors are initialized, 1 pinch -1 and 1 values and different that maps an input astatine scale $i$ to an output astatine scale $j$. For each constituent successful the input, its destination scale is looked up utilizing the 2nd vector we initialized earlier, and a dot merchandise of the first vector pinch the input vector is added to the output. The Convolution Theorem states that a convolution betwixt 2 vectors successful the clip domain is the aforesaid arsenic an element-wise merchandise successful the wave domain, which tin beryllium acquired by the Fourier Transform. They utilize this spot of convolutions to yet get their associated representation. Hedi et al. (2017) besides utilize multimodal compact bilinear pooling successful their implementation of MUTAN for VQA.

Lu et al. (2016) usage a co-attention system earlier fusing the embeddings, truthful that ocular attraction and textual attraction are some calculated. They propose 2 attraction mechanisms: parallel co-attention and alternating co-attention.

image

In parallel co-attention, they link the image and mobility by calculating the similarity betwixt image and mobility features astatine each pairs of image locations and mobility locations. They telephone the consequent practice an affinity matrix. They usage this affinity matrix to foretell the attraction maps.

$$ C = tanh(Q^{T}W_{b}V) $$

Here $C$ is the affinity matrix, $Q$ is the mobility characteristic vector, and $V$ is the vector of ocular features. $W$ represents the weights.

$$ H_{v}= tanh(W_{v}V + (W_{q}Q)C) $$

$$ H_{q}= tanh(W_{q}Q + (W_{v}V)C^{T}) $$

$$ a_{v} = softmax(w^{T}_{hv} H^{v}) $$

$$ a_{q} = softmax(w^{T}_{hq} H^{q}) $$

Where $W_{v}$, $W_{q}$, $w_{hv}$, and $ w_{hq}$ are the weight parameters. $ a_{v} $ and $ a_{q} $ are the attraction probabilities for each image region $ v_{n} $ and connection $ q_{t} $, respectively.

$$ v = \sum_{n=1}^{N} a^{v}_{n} v_{n} ,  q = \sum_{t=1}^{T} a^{q}_{t} q_{t} $$

Where $v$ and $q$ are the parallel co-attention vectors for the image and the question, respectively.

They besides propose alternating co-attention, wherever they sequentially alternate betwixt generating image and mobility attention. The image features power the mobility attraction and vice-versa.

$$ H = tanh(W_{x}X+ (W_{g}g)1^{T}) $$

$$ a_{x} = softmax(w^{T}_{hx} H) $$

$$ x=\sum a^{x}_{i} x_{i} $$

Where $1$ is simply a vector pinch each elements being $1$. $ W_{x} $, $ W_{g} $, and $ w_{hx} $ are parameters. $ a_{x} $ is the attraction weight of characteristic $X$.

In a very clever fashion, Kim et al. (2016) usage residual connections for associated characteristic representations extracted from a CNN-based architecture for image featurization, and an LSTM architecture for mobility featurization. The Multimodal Residual Network architecture looks thing for illustration this.

image

The residual connections adhd an attraction expertise to the web that the authors person visualized arsenic well.

Gao et al. (2018) realized that a batch of spatial accusation is mislaid erstwhile we return only the one-dimensional vector practice from the 2nd to past furniture of immoderate convolutional networks, for illustration ResNet. To lick for this, they usage what they telephone “question guided hybrid convolutions”, wherever they create convolutional kernels that return mobility representations created utilizing a GRU on pinch the ResNet characteristic vectors to create a associated characteristic representation.

image

The authors constituent retired that to foretell a commonly utilized $(3×3×256×256)$ kernel from a 2000-D mobility characteristic vector, the fully-connected furniture for learning the mapping generates 117 cardinal parameters, which is difficult to study and causes overfitting connected existing VQA datasets. To tackle that, they effort to foretell group convolutional kernel parameters instead. The QGHC module looks thing for illustration this.

image

Peng et al. (2019) projected that the full characteristic vector generated by an LSTM is not needed. Only keywords request to beryllium extracted from the question. They utilize this thought and projected a Word-to-Region Attention Network (WRAN), which tin find applicable entity regions and place the corresponding words successful the reference question.

Answer Generation

The investigation connected VQA includes (source):

  • Free-form, open-ended questions wherever the reply could beryllium words, phrases, and moreover complete sentences
  • Object counting questions, wherever the reply involves counting the number of objects successful 1 image
  • Multi-choice questions
  • Binary questions (yes/no)

image

Binary questions and aggregate prime questions often utilize a sigmoid furniture astatine the end. The associated representations are passed done 1 aliases 2 fully-connected layers. The output is passed done a azygous neuron furniture which functions arsenic the classification layer.

For aggregate prime questions, the reply choices are encoded utilizing immoderate embedding procreation system for illustration the ones we discussed earlier. This is fed into the system of associated characteristic generation. The associated characteristic is past passed done a fully-connected layer, and yet a multiclass classification furniture pinch a softmax activation function.

For free-form, open-ended questions, the associated characteristic representations are converted into answers usually utilizing a recurrent web for illustration LSTMs. Wu et al. (2016) extract information astir the image to supply the connection exemplary pinch much context. They usage the Doc2Vec algorithm to get embeddings, which are utilized on pinch an LSTM to make answers. Malinowski et al. (2017) devise a method successful which the image characteristic is fed on pinch each word’s practice arsenic encoded by an LSTM.

Ruwa et al. (2018) usage not conscionable image and mobility features, but besides return temper embeddings into account. A CNN-based temper detector is trained concurrently pinch the LSTM attraction exemplary successful narration to section regions of an image. The temper relates to the quality and actions of group successful the images.

Datasets

There are a batch of datasets that reside different kinds of tasks successful the ocular mobility answering domain. Some of the superior ones are discussed here.

  • DAQUAR (Dataset for Question Answering connected Real World Images) is simply a dataset of quality question-answer pairs astir images.
  • COCO-QA is an hold of the COCO (Common Objects successful Context) dataset. The questions are of 4 different types: object, number, color, and location. All answers are of a single-word type.
  • VQA dataset, which is larger than different datasets. In summation to the 204,721 images from the COCO dataset, it includes 50,000 absurd animation images. There are 3 questions per image and 10 answers per question. The dataset includes aggregate prime answers, arsenic good open-ended answers.
  • Visual Madlibs has complete 10,000 images which person 12 fill-in-the-blanks types successful the dataset. It provides 2 information methods: aggregate prime questions and capable successful the blanks.
  • Visual7W dataset contains 7 types of questions: what, where, when, who, why, how, and which. The dataset was collected connected 47,300 COCO images. In total, it has 327,939 QA pairs, together pinch 1,311,756 human-generated multiple-choices, and 561,459 entity groundings from 36,579 categories. In addition, they supply complete grounding annotations that nexus the entity mentions successful the QA sentences to their bounding boxes successful the images, and truthful present a caller QA type pinch image regions arsenic the visually grounded answers. They besides supply a toolkit for developers and AI researchers.
  • CLEVR dataset consists of a training group of 70,000 images and 699,989 questions, a validation group of 15,000 images and 149,991 questions, a trial group of 15,000 images and 14,988 questions, arsenic good arsenic answers for each train and validation questions. Besides that, they besides supply segment chart annotations for train and val images giving crushed truth locations, attributes, and relationships for objects.
  • Visual Genome has 1.7 cardinal ocular mobility answers successful their dataset. Besides VQA, they supply a batch of different benignant of information arsenic well–region descriptions, entity instances, and each the annotations mapped to WordNet synsets.

Evaluation Metrics

In aggregate prime tasks, elemental accuracy is capable to measure a fixed VQA model. But for open-ended VQA tasks, the model of an nonstop drawstring lucifer would beryllium a very rigid measurement to measure the capacity of the VQA model.

Wu and Palmer Similarity

Wu and Palmer Similarity utilizes fuzzy logic to cipher the similarity betwixt 2 phrases. This is simply a people that takes into relationship the position of concepts $ c_{1} $ and $ c_{2} $ successful the taxonomy, comparative to the position of the Least Common Subsumer ($ c_{1} $, $ c_{2} $). It assumes that the similarity betwixt 2 concepts is the usability of way magnitude and extent successful path-based measures.

The Least Common Subsumer of 2 nodes $v$ and $w$ successful a character aliases directed acyclic chart (DAG) is the deepest node that has some $v$ and $w$ arsenic descendants, wherever we specify each node to beryllium a descendant of itself. So if $v$ has a nonstop relationship from $w$, $w$ is the lowest communal ancestor.

$$ Sim_{wup}(c_{1}, c_{2}) = 2*  \frac{Depth(LCS(c_{1}, c_{2}))}{(Depth(c_{1}) + Depth(c_{2}))} $$

$ LCS(c_{1}, c_{2}) $ = lowest node successful the level that is simply a hypernym of $ c_{1} $, $ c_{2} $.

NLTK has a usability that implements WUP similarity.

WUP similarity useful for single-word answers, but doesn’t activity for phrases aliases sentences.

BLEU (Bilingual Evaluation Understudy)

This attack useful by counting matching n-grams successful the campaigner translator to n-grams successful the reference text. The comparison is made sloppy of connection order.

$$ P_{n} = \frac{\sum_{n-grams} count_{clip}(n-gram)}{\sum_{n-grams} count(n-gram)} $$

They besides person a brevity punishment attached to the score, which is defined arsenic follows.

image

Where $r$ is the magnitude of the reference reply and $c$ is the magnitude of the prediction.

$$ BLEU = BP * exp( \sum_{n=1}^{N}W_{n}logP_{n}) $$

A broad guideline to the BLEU metric tin beryllium recovered successful this this article.

BLEU has immoderate drawbacks, too. To a awesome extent, the BLEU people is based connected very simplistic matter drawstring matches. Very roughly, the larger the cluster of words that you tin lucifer exactly, the higher the BLEU score. This makes it not the champion metric if the answers are agelong and tin spell beyond mini phrases. The NLTK implementation tutorial tin beryllium recovered here.

METEOR (Metric for information of translator pinch definitive ordering)

The METEOR metric has 2 parts to its calculations. First, it calculates the precision and callback for each unigrams. Then it takes a harmonic mean of the precision and $9 \times$ the recall.

$$ F_{mean} = \frac{10PR}{P + 9R} $$

Precision and callback are calculated based connected unigram matches.

To return into relationship longer matches, METEOR computes a punishment for a fixed alignment arsenic follows. First, each the unigrams successful the strategy translator that are mapped to unigrams successful the reference translator are grouped into the fewest imaginable number of chunks, specified that the unigrams successful each chunk are successful adjacent positions successful the strategy translation, and are besides mapped to unigrams that are successful adjacent positions successful the reference translation.

The 2nd portion is simply a punishment usability that is formulated arsenic follows:

$$ punishment = 0.5 * (\frac{\text{number of chunks}}{\text{number of unigrams}})^{3} $$

Finally, the people is:

$$ people = F_{mean} * (1 - penalty) $$

Conclusion

Here we’ve covered a study of the advancement successful the section of ocular mobility answering. We understood that the problem is divided into 4 main areas of research: image featurization, mobility featurization, associated characteristic representation, and reply generation. After reviewing each, we saw an overview of respective different approaches that galore researchers person utilized to tackle these problems successful caller years. We besides took a look astatine the awesome datasets and information metrics for the task of ocular mobility answering. I dream you recovered the article useful.

More