\(K\) and \(V\) are the weight matrices for the feed-forward neural network.
\(f\) is the ReLU activation function.
\(x\) is the input to the feed-forward neural network.
Decoder
The decoder consists of inputs from both the output sequence (the sequence to be generated) and the output from the last encoder in the encoder stack.
The decoder block consists of a stack of N=6 identical layers.
The decoder block is similar to the encoder block, but with an additional multi-head attention mechanism that allows the decoder to focus on different parts of the input sequence when generating each output word.
Each decoder block consists of a multi-head self-attention layer, an encoder-decoder attention layer and the point-wise feed-forward network layer
Masked multi-head self-attention layer
The masked self-attention mechanism allows the model to only attend to words that have been generated so far and prevents it from attending future words that have not been predicted yet.
This is achieved by applying a mask to the attention weights.
A mask matrix \(M\) is created of the same size as the attention scores
This mask matrix is added element-wise to the attention scores.
\(MaskedScores_{ij}\) is the masked attention score for the i-th query and j-th key.
\(A_{ij}\) is the attention score for the i-th query and j-th key.
\(M_{ij}\) is the mask matrix.
The \(MaskedScores_{ij}\) are passed through softmax function to get the attention weights.
This softmax function will assign nearly zero weights to the positions with \(-inf\) values and thereby masking them out.
As a final step, weighted sum is calculated to get the output of the masked self-attention mechanism.
This output is then passed to the encoder-decoder attention layer.
During inference, the model generates each word one at a time and the masking is applied to prevent attending to future positions.
In other words, each predicted word is conditioned only on the previously generated words and is referred to as autoregressive generation.
Encoder-decoder attention layer
The Encoder-decoder attention layer allows the decoder to combine the encoded sequence of the encoder with the output generated from the multi-head self-attention layer.
At each time step \(t\) in the decoder, the following computations are performed: Compute the query vector \(Q_t\) using the current decoder input embedding \(X_t\):
\[
Q_t = X_t \cdot W^Q
\]
where,
\(W^Q\) is the weight matrix for the query vector.
\(X_t\) is the input to the decoder at time step \(t\).
\(Q_t\) is the query vector for the t-th time step.
Next, the keys \(K\) and values \(V\) from the encoder output sequence \(H\) are computed using:
\[
\begin{align*}
\text{K} & = H \cdot W^K \\
\text{V} & = H \cdot W^V \\
\end{align*}
\]
where,
\(W^K\) and \(W^V\) are the weight matrices for the keys and values.
\(H\) is the output of the encoder.
\(X_t\) is the input to the decoder at time step \(t\).
The attention score is calculated by taking the dot product of the query vector \(Q_t\) and the key vector \(K\) in the encoder output sequence \(H\).
The attention score is then scaled and passed through a softmax function to get the attention weights.
The weighted sum is calculated to get the output of the encoder-decoder attention layer.
Lastly, a context vector \(Context_t\) is computed to capture the relevant information from the input sequence that the decoder should focus on when generating the output token at time step \(t\).
This is computed by taking the weighted sum of the value vectors \(V\) using the attention weights:
\(\alpha_{tj}\) is the attention weight for the t-th query and j-th key.
\(v_j\) is the value vector for the j-th key.
This context vector is concatenated with the decoder input embeddings \(X_t\) and passed through a linear layer to get the output of the encoder-decoder attention layer.
This output is then passed to the point-wise feed-forward neural network layer.
Point-wise feed-forward neural network layer
The feed-forward neural network within the decoder operates in the same manner to that within the encoder.
However, there is a key difference in the input to the feed-forward network in the decoder.
The input to the point-wise feed-forward network comes from the encoder-decoder layer.
Layer normalization
Each sub-layer (masked multi-head self-attention, encoder-decoder attention, and position-wise feed-forward network) is followed by a layer normalization operation and connected with a residual connection
The layer normalization stabilizes training and enables the model to learn more effectively.
Residual connection
The residual connection is applied around each sub-layer in the decoder.
They are applied around the masked multi-head self-attention layer, encoder-decoder attention layer and point-wise feed-forward neural network layer.
Linear or Output layer
The linear or the output layer determines the likelihood of each word being the next word in the output sequence.
The input to this layer is the output from the layer normalization.
The purpose of the linear layer is to map the output from the layer normalization to a vector of raw scores (logits) corresponding to each word being the next word in the output sequence.
This done by applying a linear transformation, which involves multiplying the output from the layer normalization layer by a weight matrix and adding a bias vector.
Softmax
The output of the linear layer is passed through a softmax function to get the probability distribution over the output vocabulary.
During the training phase, the predicted probabilities are used to compute the cross-entropy loss function, which measures the dissimilarity between the predicted distribution and the true distribution.
During the inference phase, the word with the highest probability at each position is chosen as the predicted output.