Week 04
Intro to Transformers

LLMs in Lingustic Research WiSe 2024/25

Akhilesh Kakolu Ramarao


30 Oct 2024

Converting words to vectors

  • Machines cannot understand words directly, they can only understand numbers.

  • Words to vectors is a process of converting words into numerical vectors.

  • Can you think of ways to convert words to vectors?

  • How would you represent the sentence “I am a student” as a vector?

One-hot encoding

  • One-hot encoding is the simplest way to convert words to vectors.
  • Each word is represented as a vector of zeros with a 1 at the index corresponding to the word.
  • For example, the sentence “I am a student” can be represented as a matrix of one-hot encoded vectors.
  • “I” is represented as [1, 0, 0, 0], “am” is represented as [0, 1, 0, 0], and so on.

Limitations of one-hot encoding

  • One-hot encoding has several limitations:
    • It does not capture the relationships between words.
    • It does not consider the context in which the words appear.
    • It does not account for the similarity between words.

Word embeddings

  • Word embeddings are dense vector representations of words that capture the relationships between words.
  • Word embeddings are learned from large text corpora using neural networks.
  • For example, the word “king” might be represented as [0.2, 0.3, 0.5], and the word “queen” might be represented as [0.1, 0.4, 0.6]. These vectors capture the relationship between the two words.
  • These vectors are learned in such a way that similar words have similar embeddings.


  • A transformer is a type of deep learning model that has been widely used in natural language processing tasks.
  • The transformer architecture was introduced in the paper Attention is All You Need by Vaswani et al. (2017).
  • Unlike older models, Transformers can process entire sentences simultaneously rather than word by word. This makes them faster and more accurate.
  • They form the foundation of many powerful language models such as GPT-3, BERT, and T5.
  • They have been used in a wide range of applications such as machine translation, text summarization, and question answering.

Transformer architecture

Image from “Attention is all you need”

  • In a transformer-based encoder-decoder architecture, the transformer consists of a encoder block and a decoder block
  • The encoder block consists of a stack of N=6 identical layers
  • The decoder block consists of a stack of N=6 identical layers
  • Input sequence is passed through the encoder block to generate a sequence of hidden states which are then passed through the decoder block to generate the output sequence
  • For example, input sequence is a sentence in English and output sequence is a sentence in German. “I am a student” -> “Ich bin ein Student”


  • Each word in the input and output sequences is represented as a 512-dimensional vector called an embedding.
  • The embedding layer maps each word to its corresponding embedding vector.
  • Before training begins, each word is assigned a random embedding vector.
  • These are small random values obtained from a normal distribution.
  • The model tries to capture the patterns and dependencies between words by continuously updating this embedding during training.

Positional Encoding

  • Positional encoding is added to the input embeddings to give the model information about the position of each word in the sequence.
  • Unlike humans who naturally read from left to right, the transformer needs a special way to understand that “word 1 comes before word 2.”
  • The positional encoding in the original transformer is implemented using sine function of different frequencies.
  • Using sine waves makes it easier for the transformer to understand both nearby and far-apart relationships between words. (It’s similar to how music uses different frequencies to create unique sounds.)
  • The positional encoding vectors have the same dimensions as the embedding vectors and are added element-wise to create the input representation for each word.
  • This allows the model to differentiate between words based on their position in the sequence.

  • Before diving into multi-head attention, it is important to understand self-attention mechanism, which forms the foundation of multi-head attention.
  • In the transformer currently implemented, the self-attention mechanism is applied for each word in the sequence.


  • Self-attention is a mechanism that helps the model weigh the importance of different words in the input sequence when generating each output word.
  • It is a mechanism that allows the model to focus on different parts of the input sequence when generating each output word.
  • The self-attention mechanism is applied to each word in the input sequence.

The self-attention mechanism has the following steps:

  1. Linear Transformation:
    • A linear transformation is applied to the input representation (obtained from the Embedding and Positional encoding) to get query vector (Q), key vector (K) and value vector (V) for each word.
    • Query vectors are responsible for expressing what the model is currently looking for in the input sequence.
    • Key vectors have representations which provide information of inter-word dependencies and connections between words.
    • Value vectors contain additional information of each word in the input sequence.

Given an input sequence, \[ \begin{align*} \text{X} & = [x_1, x_2, x_3, \ldots, x_n] \\ \end{align*} \]

The linear transformations are expressed as:

\[ \begin{align*} \text{Q} & = \text{X} \cdot \text{W}^Q \\ \text{K} & = \text{X} \cdot \text{W}^K \\ \text{V} & = \text{X} \cdot \text{W}^V \\ \end{align*} \]

where, \(\text{W}^Q\), \(\text{W}^K\) and \(\text{W}^V\) are the weight matrices for the query, key and value vectors respectively. Q, K and V are the query, key and value matrices respectively.

  • The same linear transformation is applied to all words in the input sequence.
  • Through the linear tranformation, the input word embeddings are mapped to three different contexts: Query, Key and Value.

  1. Scaled Dot-Product Attention: After the linear transformation, the model computes attention scores by calculating the dot products of each element in the query vector and the key vector, scaling them and applying a softmax function to get the attention weights.
  1. Dot-product

Given the set of query vectors,

\[ \begin{align*} \text{Q} & = [q_1, q_2, q_3, \ldots, q_n] \\ \end{align*} \]

Given the set of key vectors,

\[ \begin{align*} \text{K} & = [k_1, k_2, k_3, \ldots, k_n] \\ \end{align*} \]

The attention score matrix \(\text{A}\) is a matrix where each entry \(A_{ij}\) is the dot product of the i-th query and and j-th key.

\[ \begin{align*} A_{ij} & = q_i \cdot k_j \\ \end{align*} \]

  1. Scaling
  • The dot-products from above, can potentially become very large.
  • Large values affect the training by causing issues in softmax (as these large values might exceed the representable range of numerical precision of the computer, which leads to incorrect outputs).
  • So, the dot-products are scaled by the square root of the dimension of the key vectors (\(d_{k}\)).

\[ \begin{align*} A_{ij} & = \frac{q_i \cdot k_j}{\sqrt{d_{k}}} \\ \end{align*} \]

  1. Softmax
  • The scaled dot-products are passed through a softmax function to get the attention weights.

\[ \begin{align*} \alpha_{ij} & = \frac{\exp(A_{ij})}{\sum_{j=1}^{n} \exp(A_{ij})} \end{align*} \]

explained as: where,

  • \(\alpha_{ij}\) is the attention weight for the i-th query and j-th key.
  • \(\exp\) is the exponential function (e=2.71828).
  • The softmax function is applied to each row of the attention score matrix \(\text{A}\).

  1. Weighted Sum
  • The weighted sum is the sum of the element-wise product of the attention weights and the corresponding value vector.
  • The weighted sum is the output of the self-attention mechanism.

\[ \begin{align*} \text{O} = \sum_{j=1}^{n} \alpha_{ij} \cdot v_j \end{align*} \]


  • \(\text{O}\) is the output of the self-attention mechanism.
  • \(\alpha_{ij}\) is the attention weight for the i-th query and j-th key.
  • \(v_j\) is the value vector for the j-th key.

Multi-headed attention

  • The self-attention mechanism is applied multiple times in parallel to the input sequence.
  • “Head” refers to an individual component in the multi-head self-attention mechanism that independently learns different self-attention patterns.
  • This allows the model to focus on multiple parts of the input sequence in parallel and thereby allowing the model to capture the entire context.
  • This also makes the model more computationally efficient as it enables parallel processing across different heads.

The outputs from all the attention heads are mapped to a linear layer:

\[ \begin{align*} \text{O} & = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h) \cdot \text{W}^O \\ \end{align*} \]

where, \(\text{W}^O\) is the weight matrix for the output of the multi-head attention mechanism.

  • The outputs are then passed to a feed-forward neural network.