LLMs in Lingustic Research WiSe 2024/25
28 Jan 2025
A simple trick to not update all the weights but only a few trainable parameters during finetuning.
Paper: https://arxiv.org/pdf/2104.08691, check Figure 1 for performance overview
In a prompt-based LLM, the output is highly dependent on the model weights and the prompt.
The idea of prompt tuning is to add a new set of weights between the prompt and the model and only tune those.
In the above paper, the T5 model weights were 11,000,000,000 parameters, and the added prompt weights were 20,400 parameters. Yet, similar performance was observed with prompt tuning w.r.t full model weight fine-tuning on a benchmark dataset.
LORA is a technique to reduce the number of parameters that need to be trained by reducing the dimensions using Single Value Decomposition (SVD).
Dimensionality reduction:
The pretrained model parameters are in a super high dimensional space. It is shown that finetuning fewer parameters in a compressed space can drastically affect the final output. This trick is used to creating a new set of LORA parameters that are much smaller than the others.
Let’s check the figure 1 from the above paper…
These LORA weights can be swapped in and out for different tasks and/or different LORA weights can be stored as different model implementations.
Further reading:
Typically, the parameters of pre-trained models are stored in a 32-bit format, and QLoRa compresses them to 4-bit format. This reduces the memory footprint and enables finetuning using much less resources. Each parameter in the model is represented by 32 bits (float32). One float32 has 32 bits which is 4 bytes. So it requires 4 gigabytes for one billion parameter model trained on FP32.
Post-Training Quantization (PTQ): quantization on LLM after it has been trained.
Quantization-Aware Training (QAT): quantization on LLM during training.
Distillation is the process of using the generated output or generation to finetune the model.
Knowledge Distillation: Take a big model and make it teach a smaller model.
Context distillation: Prompting the model with a prompt prefix (“analyze like a linguist” and then training on the model response.
Here, the model is prompted to take up roles, such as act as an “expert” or “actor” or “be a linguist”. This cannot be categorized as finetuning because no weights are updated.
As this method doesn’t require any additional compute at train time, it is easy to iterate upon. But this method does not affect the LLM behaviour as strongly as fine-tuning.
Know more: https://huggingface.co/docs/peft/conceptual_guides/prompting
LLMs in Lingustic Research WiSe 2024/25