Week 04
Intro to Transformers

LLMs in Lingustic Research WiSe 2024/25

Akhilesh Kakolu Ramarao

HHU

30 Oct 2024

What we will cover today:

  • Converting words to vectors
  • Transformers
    • Transformer architecture
    • Embedding
    • Positional Encoding
    • Self-Attention
    • Multi-headed attention

Converting words to vectors

  • Machines cannot understand words directly, they can only understand numbers.

  • Words to vectors is a process of converting words into numerical vectors.

  • Can you think of ways to convert words to vectors?

  • How would you represent the sentence “I am a student” as a vector?

One-hot encoding

  • One-hot encoding is the simplest way to convert words to vectors.
  • Each word is represented as a vector of zeros with a 1 at the index corresponding to the word.
  • For example, the sentence “I am a student” can be represented as a matrix of one-hot encoded vectors.
  • “I” is represented as [1, 0, 0, 0], “am” is represented as [0, 1, 0, 0], and so on.

Limitations of one-hot encoding

  • One-hot encoding has several limitations:
    • It does not capture the relationships between words.
    • It does not consider the context in which the words appear.
    • It does not account for the similarity between words.

Word embeddings

  • Word embeddings are dense vector representations of words that capture the relationships between words.
  • Word embeddings are learned from large text corpora using neural networks.
  • For example, the word “king” might be represented as [0.2, 0.3, 0.5], and the word “queen” might be represented as [0.1, 0.4, 0.6]. These vectors capture the relationship between the two words.
  • These vectors are learned in such a way that similar words have similar embeddings.

Transformers

  • A transformer is a type of deep learning model that has been widely used in natural language processing tasks.
  • The transformer architecture was introduced in the paper Attention is All You Need by Vaswani et al. (2017).
  • Unlike older models, Transformers can process entire sentences simultaneously rather than word by word. This makes them faster and more accurate.
  • They form the foundation of many powerful language models such as GPT-3, BERT, and T5.
  • They have been used in a wide range of applications such as machine translation, text summarization, and question answering.

Transformer architecture

Image from “Attention is all you need”

  • In a transformer-based encoder-decoder architecture, the transformer consists of a encoder block and a decoder block
  • The encoder block consists of a stack of N=6 identical layers
  • The decoder block consists of a stack of N=6 identical layers
  • Input sequence is passed through the encoder block to generate a sequence of hidden states which are then passed through the decoder block to generate the output sequence
  • For example, input sequence is a sentence in English and output sequence is a sentence in German. “I am a student” -> “Ich bin ein Student”

Embedding

  • Each word in the input and output sequences is represented as a 512-dimensional vector called an embedding.
  • The embedding layer maps each word to its corresponding embedding vector.
  • Before training begins, each word is assigned a random embedding vector.
  • These are small random values obtained from a normal distribution.
  • The model tries to capture the patterns and dependencies between words by continuously updating this embedding during training.

Positional Encoding

  • Positional encoding is added to the input embeddings to give the model information about the position of each word in the sequence.
  • Unlike humans who naturally read from left to right, the transformer needs a special way to understand that “word 1 comes before word 2.”
  • The positional encoding in the original transformer is implemented using sine function of different frequencies.
  • Using sine waves makes it easier for the transformer to understand both nearby and far-apart relationships between words. (It’s similar to how music uses different frequencies to create unique sounds.)
  • The positional encoding vectors have the same dimensions as the embedding vectors and are added element-wise to create the input representation for each word.
  • This allows the model to differentiate between words based on their position in the sequence.

  • Before diving into multi-head attention, it is important to understand self-attention mechanism, which forms the foundation of multi-head attention.
  • In the transformer currently implemented, the self-attention mechanism is applied for each word in the sequence.

Self-Attention

  • Self-attention is a mechanism that helps the model weigh the importance of different words in the input sequence when generating each output word.
  • It is a mechanism that allows the model to focus on different parts of the input sequence when generating each output word.
  • The self-attention mechanism is applied to each word in the input sequence.

The self-attention mechanism has the following steps:

  1. Linear Transformation:
    • A linear transformation is applied to the input representation (obtained from the Embedding and Positional encoding) to get query vector (Q), key vector (K) and value vector (V) for each word.
    • Query vectors are responsible for expressing what the model is currently looking for in the input sequence.
    • Key vectors have representations which provide information of inter-word dependencies and connections between words.
    • Value vectors contain additional information of each word in the input sequence.

Given an input sequence, X=[x1,x2,x3,…,xn]

The linear transformations are expressed as:

Q=X⋅WQK=X⋅WKV=X⋅WV

where, WQ, WK and WV are the weight matrices for the query, key and value vectors respectively. Q, K and V are the query, key and value matrices respectively.

  • The same linear transformation is applied to all words in the input sequence.
  • Through the linear tranformation, the input word embeddings are mapped to three different contexts: Query, Key and Value.

  1. Scaled Dot-Product Attention: After the linear transformation, the model computes attention scores by calculating the dot products of each element in the query vector and the key vector, scaling them and applying a softmax function to get the attention weights.
  1. Dot-product

Given the set of query vectors,

Q=[q1,q2,q3,…,qn]

Given the set of key vectors,

K=[k1,k2,k3,…,kn]

The attention score matrix A is a matrix where each entry Aij is the dot product of the i-th query and and j-th key.

Aij=qi⋅kj

  1. Scaling
  • The dot-products from above, can potentially become very large.
  • Large values affect the training by causing issues in softmax (as these large values might exceed the representable range of numerical precision of the computer, which leads to incorrect outputs).
  • So, the dot-products are scaled by the square root of the dimension of the key vectors (dk).

Aij=qi⋅kj√dk

  1. Softmax
  • The scaled dot-products are passed through a softmax function to get the attention weights.

αij=exp(Aij)∑nj=1exp(Aij)

explained as: where,

  • αij is the attention weight for the i-th query and j-th key.
  • exp is the exponential function (e=2.71828).
  • The softmax function is applied to each row of the attention score matrix A.

  1. Weighted Sum
  • The weighted sum is the sum of the element-wise product of the attention weights and the corresponding value vector.
  • The weighted sum is the output of the self-attention mechanism.

O=n∑j=1αij⋅vj

where,

  • O is the output of the self-attention mechanism.
  • αij is the attention weight for the i-th query and j-th key.
  • vj is the value vector for the j-th key.

Multi-headed attention

  • The self-attention mechanism is applied multiple times in parallel to the input sequence.
  • “Head” refers to an individual component in the multi-head self-attention mechanism that independently learns different self-attention patterns.
  • This allows the model to focus on multiple parts of the input sequence in parallel and thereby allowing the model to capture the entire context.
  • This also makes the model more computationally efficient as it enables parallel processing across different heads.

The outputs from all the attention heads are mapped to a linear layer:

O=Concat(head1,head2,…,headh)⋅WO

where, WO is the weight matrix for the output of the multi-head attention mechanism.

  • The outputs are then passed to a feed-forward neural network.

LLMs in Lingustic Research WiSe 2024/25

1 / 19
Week 04 Intro to Transformers LLMs in Lingustic Research WiSe 2024/25 Akhilesh Kakolu Ramarao HHU 30 Oct 2024

  1. Slides

  2. Tools

  3. Close
  • Week 04 Intro to Transformers
  • What we will cover today:
  • Converting words to vectors
  • One-hot encoding...
  • Limitations of one-hot...
  • Transformers
  • Transformer architecture
  • In a transformer-based...
  • Embedding
  • Positional Encoding
  • Before diving into...
  • Self-Attention
  • The self-attention...
  • Given an input sequence,...
  • Scaled Dot-Product...
  • The attention score...
  • Softmax The scaled...
  • Weighted Sum The...
  • Multi-headed attention
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • ? Keyboard Help