- In a transformer-based encoder-decoder architecture, the transformer consists of a encoder block and a decoder block
- The encoder block consists of a stack of N=6 identical layers
- The decoder block consists of a stack of N=6 identical layers
- Input sequence is passed through the encoder block to generate a sequence of hidden states which are then passed through the decoder block to generate the output sequence
- For example, input sequence is a sentence in English and output sequence is a sentence in German. “I am a student” -> “Ich bin ein Student”
Embedding
- Each word in the input and output sequences is represented as a 512-dimensional vector called an embedding.
- The embedding layer maps each word to its corresponding embedding vector.
- Before training begins, each word is assigned a random embedding vector.
- These are small random values obtained from a normal distribution.
- The model tries to capture the patterns and dependencies between words by continuously updating this embedding during training.
Positional Encoding
- Positional encoding is added to the input embeddings to give the model information about the position of each word in the sequence.
- Unlike humans who naturally read from left to right, the transformer needs a special way to understand that “word 1 comes before word 2.”
- The positional encoding in the original transformer is implemented using sine function of different frequencies.
- Using sine waves makes it easier for the transformer to understand both nearby and far-apart relationships between words. (It’s similar to how music uses different frequencies to create unique sounds.)
- The positional encoding vectors have the same dimensions as the embedding vectors and are added element-wise to create the input representation for each word.
- This allows the model to differentiate between words based on their position in the sequence.
- Before diving into multi-head attention, it is important to understand self-attention mechanism, which forms the foundation of multi-head attention.
- In the transformer currently implemented, the self-attention mechanism is applied for each word in the sequence.
Self-Attention
- Self-attention is a mechanism that helps the model weigh the importance of different words in the input sequence when generating each output word.
- It is a mechanism that allows the model to focus on different parts of the input sequence when generating each output word.
- The self-attention mechanism is applied to each word in the input sequence.
The self-attention mechanism has the following steps:
- Linear Transformation:
- A linear transformation is applied to the input representation (obtained from the Embedding and Positional encoding) to get query vector (Q), key vector (K) and value vector (V) for each word.
- Query vectors are responsible for expressing what the model is currently looking for in the input sequence.
- Key vectors have representations which provide information of inter-word dependencies and connections between words.
- Value vectors contain additional information of each word in the input sequence.
Given an input sequence, \[
\begin{align*}
\text{X} & = [x_1, x_2, x_3, \ldots, x_n] \\
\end{align*}
\]
The linear transformations are expressed as:
\[
\begin{align*}
\text{Q} & = \text{X} \cdot \text{W}^Q \\
\text{K} & = \text{X} \cdot \text{W}^K \\
\text{V} & = \text{X} \cdot \text{W}^V \\
\end{align*}
\]
where, \(\text{W}^Q\), \(\text{W}^K\) and \(\text{W}^V\) are the weight matrices for the query, key and value vectors respectively. Q, K and V are the query, key and value matrices respectively.
- The same linear transformation is applied to all words in the input sequence.
- Through the linear tranformation, the input word embeddings are mapped to three different contexts: Query, Key and Value.
- Scaled Dot-Product Attention: After the linear transformation, the model computes attention scores by calculating the dot products of each element in the query vector and the key vector, scaling them and applying a softmax function to get the attention weights.
- Dot-product
Given the set of query vectors,
\[
\begin{align*}
\text{Q} & = [q_1, q_2, q_3, \ldots, q_n] \\
\end{align*}
\]
Given the set of key vectors,
\[
\begin{align*}
\text{K} & = [k_1, k_2, k_3, \ldots, k_n] \\
\end{align*}
\]
The attention score matrix \(\text{A}\) is a matrix where each entry \(A_{ij}\) is the dot product of the i-th query and and j-th key.
\[
\begin{align*}
A_{ij} & = q_i \cdot k_j \\
\end{align*}
\]
- Scaling
- The dot-products from above, can potentially become very large.
- Large values affect the training by causing issues in softmax (as these large values might exceed the representable range of numerical precision of the computer, which leads to incorrect outputs).
- So, the dot-products are scaled by the square root of the dimension of the key vectors (\(d_{k}\)).
\[
\begin{align*}
A_{ij} & = \frac{q_i \cdot k_j}{\sqrt{d_{k}}} \\
\end{align*}
\]
- Softmax
- The scaled dot-products are passed through a softmax function to get the attention weights.
\[
\begin{align*}
\alpha_{ij} & = \frac{\exp(A_{ij})}{\sum_{j=1}^{n} \exp(A_{ij})}
\end{align*}
\]
explained as: where,
- \(\alpha_{ij}\) is the attention weight for the i-th query and j-th key.
- \(\exp\) is the exponential function (e=2.71828).
- The softmax function is applied to each row of the attention score matrix \(\text{A}\).
- Weighted Sum
- The weighted sum is the sum of the element-wise product of the attention weights and the corresponding value vector.
- The weighted sum is the output of the self-attention mechanism.
\[
\begin{align*}
\text{O} = \sum_{j=1}^{n} \alpha_{ij} \cdot v_j
\end{align*}
\]
where,
- \(\text{O}\) is the output of the self-attention mechanism.
- \(\alpha_{ij}\) is the attention weight for the i-th query and j-th key.
- \(v_j\) is the value vector for the j-th key.
Multi-headed attention
- The self-attention mechanism is applied multiple times in parallel to the input sequence.
- “Head” refers to an individual component in the multi-head self-attention mechanism that independently learns different self-attention patterns.
- This allows the model to focus on multiple parts of the input sequence in parallel and thereby allowing the model to capture the entire context.
- This also makes the model more computationally efficient as it enables parallel processing across different heads.
The outputs from all the attention heads are mapped to a linear layer:
\[
\begin{align*}
\text{O} & = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h) \cdot \text{W}^O \\
\end{align*}
\]
where, \(\text{W}^O\) is the weight matrix for the output of the multi-head attention mechanism.
- The outputs are then passed to a feed-forward neural network.