The age of transformers.

Understanding the architecture.

A transformer is a neural network built around a self-attention mechanism (typically multi-head attention) that forms context-aware token representations by weighting information from surrounding tokens, enabling the model to capture long-range dependencies. The seminal transformer paper “Attention Is All You Need” marked a significant departure from previous recurrent architectures in natural language processing (NLP). Its key innovations revolutionized how sequence-to-sequence tasks were approached, leading to substantial improvements in efficiency and performance. In this article, I will get into the details of how transformers comprehend user input and utilizes self-attention - a key concept that allows transformers to understand relationship between words.

Input Embeddings

The first step in a transformer model is to convert input tokens (typically words or subwords) into vectors. This process is achieved through input embeddings.

Tokenization:

The input text is split into tokens. Depending on the model and task, these tokens might be words, subwords, or characters.

Embedding Layer:

Each token is then mapped to a high-dimensional vector via an embedding layer. This embedding layer is learned during the training process and enables the model to capture semantic information about each token.

Role of Embeddings:

These embeddings act as the initial representations of the tokens, capturing their meanings in a format that the neural network can process. They are dense vectors that each token gets converted into, serving as the starting point for further processing by the transformer.

I will skip the details on embeddings here as I did cover them in detail in one of my previous posts Language Models - The Basics.

Positional Encoding

Positional encodings in transformers are a crucial component, as they provide information about the position of each token in the sequence. Unlike recurrent neural networks, transformers do not have any inherent sense of order or sequence in their architecture. Positional encodings are added to the input embeddings to compensate for this and enable the model to understand the order of words in a sentence.

In the original Transformer model, they only considered fixed (non-trainable) positional encoding. The positional encodings had the same dimension as the embeddings and were added to them at the bottom of the encoder and decoder stacks. The original positional encoding uses a sinusoidal function, which can be described using the following equations

\[\begin{align} PE_{(pos,2i)} &= \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \\ PE_{(pos,2i+1)} &= \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right) \end{align}\]

As you can see, these values depend on $d_{\text{model}}$ (input dimension) and $i$ (index of the position vector). In general, these fixed positional encodings have some major disadvantages:

Attention Mechanism as the Core Component

Attention is the mechanism in the transformer that weighs and combines the representations from appropriate other tokens in the context from layer k − 1 to build the representation for tokens in layer k. At the core of an attention-based approach is the ability to compare an item of interest to a collection of other items in a way that reveals their relevance in the current context. It takes an input representation $x_i$ corresponding to the input token at position $i$, and a context window of prior inputs ($x_1,\ldots,x_{i-1}$), and produces an output $a_i$.

For pedagogical purposes, let’s first describe a simplified intuition of attention: the attention output (a_i) at token position (i) is a weighted sum of the representations $x_j$ for all $j \le i$. We use $\alpha_{ij}$ to denote how much $x_j$ contributes to $a_i$.

\[a_i = \sum_{j \le i} \alpha_{ij}\, x_j\]

In the case of self-attention, the set of comparisons is to other elements within a given sequence. The result of these comparisons is then used to compute an output for the current input.

Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors emerges. Let’s call the input vectors $x_1, x_2, \ldots, x_t$ and the corresponding output vectors $y_1, y_2, \ldots, y_t$. The vectors all have dimension $k$. To produce the output vector $𝐲_i$, the self-attention operation takes a weighted average over all the input vectors.

\[\begin{align} y_i = \sum_j w_{ij} x_j \end{align}\]

The weight $w_{ij}$ is not a parameter, as in a normal neural net, but it is derived from a function over $𝐱_{i}$ and $𝐱_{j}$. The simplest option for this function is the dot product Note that $𝐱_{i}$ is the input vector at the same position as the current output vector $𝐲_{i}$. For the next output vector, we get an entirely new series of dot products, and a different weighted sum.:

\[\begin{align} w_{ij}' = x_i^T x_j \end{align}\]

The dot product can yield a value anywhere from negative to positive infinity. To confine these values within the [0,1] interval and to guarantee that the total across the entire sequence equals 1, a softmax function is applied:

\[\begin{align} w_{ij} = \frac{w_{ij}'}{\sum_j w_{ij}'} \end{align}\]

And that is the basic idea of self-attention. This kind of simple attention can be useful. But transformers allow us to create a more sophisticated way of representing how words can contribute to the representation of longer inputs. Let’s consider the three different roles that each input embedding plays during the course of the attention process.

To capture these three different roles, transformers introduce weight matrices $\mathbf{W_Q}$, $\mathbf{W_K}$, and $\mathbf{W_V}$. These weights will be used to project each input vector $\mathbf{x_i}$ into a representation of its role as a key, query, or value.

$\mathbf{q_i} = \mathbf{W_Q} \mathbf{x_i}$ ; $\mathbf{k_i} = \mathbf{W_K} \mathbf{x_i}$ ; $\mathbf{v_i} = \mathbf{W_V} \mathbf{x_i}$

Given these projections, the score between a current focus of attention, $x_i$, and an element in the preceding context, $x_j$, consists of a dot product between its query vector $q_i$ and the preceding element’s key vectors $k_j$, which basically gives $w_{ij}’$ from Eq. (2).

The dot product may yield large values, which, when exponentiated, can cause numerical issues and gradient loss during training. To mitigate this, a scaled dot product is divided by a factor related to the embedding size. Commonly, this is the square root of the vector dimension ( $d_k$ ).

\[\begin{equation} \text{score}(q, k) = \frac{q \cdot k}{\sqrt{d_k}} \end{equation}\]