- In a transformer-based encoder-decoder architecture, the transformer consists of a encoder block and a decoder block
- The encoder block consists of a stack of N=6 identical layers
- The decoder block consists of a stack of N=6 identical layers
- Input sequence is passed through the encoder block to generate a sequence of hidden states which are then passed through the decoder block to generate the output sequence
- For example, input sequence is a sentence in English and output sequence is a sentence in German. “I am a student” -> “Ich bin ein Student”
Embedding
- Each word in the input and output sequences is represented as a 512-dimensional vector called an embedding.
- The embedding layer maps each word to its corresponding embedding vector.
- Before training begins, each word is assigned a random embedding vector.
- These are small random values obtained from a normal distribution.
- The model tries to capture the patterns and dependencies between words by continuously updating this embedding during training.
Positional Encoding
- Positional encoding is added to the input embeddings to give the model information about the position of each word in the sequence.
- Unlike humans who naturally read from left to right, the transformer needs a special way to understand that “word 1 comes before word 2.”
- The positional encoding in the original transformer is implemented using sine function of different frequencies.
- Using sine waves makes it easier for the transformer to understand both nearby and far-apart relationships between words. (It’s similar to how music uses different frequencies to create unique sounds.)
- The positional encoding vectors have the same dimensions as the embedding vectors and are added element-wise to create the input representation for each word.
- This allows the model to differentiate between words based on their position in the sequence.
- Before diving into multi-head attention, it is important to understand self-attention mechanism, which forms the foundation of multi-head attention.
- In the transformer currently implemented, the self-attention mechanism is applied for each word in the sequence.
Self-Attention
- Self-attention is a mechanism that helps the model weigh the importance of different words in the input sequence when generating each output word.
- It is a mechanism that allows the model to focus on different parts of the input sequence when generating each output word.
- The self-attention mechanism is applied to each word in the input sequence.
The self-attention mechanism has the following steps:
- Linear Transformation:
- A linear transformation is applied to the input representation (obtained from the Embedding and Positional encoding) to get query vector (Q), key vector (K) and value vector (V) for each word.
- Query vectors are responsible for expressing what the model is currently looking for in the input sequence.
- Key vectors have representations which provide information of inter-word dependencies and connections between words.
- Value vectors contain additional information of each word in the input sequence.
Given an input sequence, X=[x1,x2,x3,…,xn]
The linear transformations are expressed as:
Q=X⋅WQK=X⋅WKV=X⋅WV
where, WQ, WK and WV are the weight matrices for the query, key and value vectors respectively. Q, K and V are the query, key and value matrices respectively.
- The same linear transformation is applied to all words in the input sequence.
- Through the linear tranformation, the input word embeddings are mapped to three different contexts: Query, Key and Value.
- Scaled Dot-Product Attention: After the linear transformation, the model computes attention scores by calculating the dot products of each element in the query vector and the key vector, scaling them and applying a softmax function to get the attention weights.
- Dot-product
Given the set of query vectors,
Q=[q1,q2,q3,…,qn]
Given the set of key vectors,
K=[k1,k2,k3,…,kn]
The attention score matrix A is a matrix where each entry Aij is the dot product of the i-th query and and j-th key.
Aij=qi⋅kj
- Scaling
- The dot-products from above, can potentially become very large.
- Large values affect the training by causing issues in softmax (as these large values might exceed the representable range of numerical precision of the computer, which leads to incorrect outputs).
- So, the dot-products are scaled by the square root of the dimension of the key vectors (dk).
Aij=qi⋅kj√dk
- Softmax
- The scaled dot-products are passed through a softmax function to get the attention weights.
αij=exp(Aij)∑nj=1exp(Aij)
explained as: where,
- αij is the attention weight for the i-th query and j-th key.
- exp is the exponential function (e=2.71828).
- The softmax function is applied to each row of the attention score matrix A.
- Weighted Sum
- The weighted sum is the sum of the element-wise product of the attention weights and the corresponding value vector.
- The weighted sum is the output of the self-attention mechanism.
O=n∑j=1αij⋅vj
where,
- O is the output of the self-attention mechanism.
- αij is the attention weight for the i-th query and j-th key.
- vj is the value vector for the j-th key.
Multi-headed attention
- The self-attention mechanism is applied multiple times in parallel to the input sequence.
- “Head” refers to an individual component in the multi-head self-attention mechanism that independently learns different self-attention patterns.
- This allows the model to focus on multiple parts of the input sequence in parallel and thereby allowing the model to capture the entire context.
- This also makes the model more computationally efficient as it enables parallel processing across different heads.
The outputs from all the attention heads are mapped to a linear layer:
O=Concat(head1,head2,…,headh)⋅WO
where, WO is the weight matrix for the output of the multi-head attention mechanism.
- The outputs are then passed to a feed-forward neural network.