## Finetuning LLMs for machine translation task

We will be finetuning Text-to-Text Transformer (T5) for machine translation task.

## T5 ([Raffel et al.](https://arxiv.org/pdf/1910.10683))

T5 uses an encoder-decoder architecture that closely resembles the original transformer.

T5's unique feature is its unified text-to-text approach that reformulates all NLP tasks into a consistent format:

- Every task is framed as a text-to-text transformation problem
- Output is always generated as text, even for classification tasks
- Input includes task-specific prefixes

For example,

Translation: "translate English to German: Hello!" → "Hallo!"
Sentiment analysis: "st sentence: I had a great time!" → "positive"

### Pre-training

The model is pre-trained on the [Colossal Clean Crawled Corpus (C4)](https://www.tensorflow.org/datasets/catalog/c4). This has several advantages:

- Twice as big as Wikipedia
- Cleaned through deduplication and filtering

### How is it different from the original transformer?

- T5 reformulates all NLP tasks into a text-to-text format, unlike the original transformer which was primarily designed for machine translation
- Uses task-specific prefixes (e.g., "translate:", "summarize:")

#### Key differences:

- Applies normalization before attention and feed-forward layers instead of after
- Uses residual connections after each sub-layer to maintain gradient flow
- Dropout is applied throughout the network (e.g., attention weights, feed forward network, skip connection, etc.). Remeber: Dropout is a regularization technique used to prevent overfitting. We randomly deactivating neurons during training with a specified probability
- Uses alternating sine and cosine functions for position encoding

### Pre-LN Architecture

- Places layer normalization inside residual blocks
- Enables training without learning rate warm-up
- Results in better training stability


In [None]:
# transformers: Hugging Face's library for state-of-the-art NLP models
# datasets: Library for easily accessing and sharing datasets
# evaluate: Framework for evaluating machine learning models
# sacrebleu: Library for evaluating machine translation quality using BLEU score

!pip install transformers datasets evaluate sacrebleu

In [None]:
# Import necessary libraries and modules

# Load datasets from Hugging Face's datasets library
from datasets import load_dataset

# Import essential components from transformers library:
from transformers import (
    AutoTokenizer,          # For automatic tokenizer loading based on model name
    DataCollatorForSeq2Seq, # Handles batching and padding for sequence-to-sequence tasks
    AutoModelForSeq2SeqLM,  # For automatic loading of sequence-to-sequence models
    Seq2SeqTrainingArguments, # Contains training configuration
    Seq2SeqTrainer,         # Handles the training loop for sequence-to-sequence models
    pipeline               # Provides easy-to-use interfaces for various NLP tasks
)

# Import evaluation tools
from evaluate import evaluator
import evaluate

# Import numerical computing library
import numpy as np

# Import PyTorch for deep learning operations
import torch

# Import DatasetDict for managing train/validation/test splits
from datasets import DatasetDict

### Loading dataset


In [None]:
# Load the TED Talks dataset from IWSLT 2013 conference
# Parameters:
# - "ted_iwlst2013": Dataset name (TED talks from IWSLT 2013)
# - "de-en": Language pair (German to English)
# - trust_remote_code=True: Allows execution of remote code for dataset loading
dataset = load_dataset("ted_iwlst2013", "de-en", trust_remote_code=True)

In [None]:
print(dataset)

```
DatasetDict({
    # Training split containing:
    train: Dataset({
        # Features in the dataset:
        features: [
            'id',           # Unique identifier for each example
            'translation'   # Contains the parallel text pairs
        ],
        num_rows: 143836   # Total number of training examples
    })
})
```

In [None]:
# Load a pre-trained tokenizer for the T5-small model
tokenizer = AutoTokenizer.from_pretrained("t5-small")

In [None]:

#Split the data into train, test and validation sets

# Split the first 50000 examples of the training dataset into train and test sets

dataset_train_valid_test_split = dataset["train"].select(range(50000)).train_test_split(test_size=0.1)

# Split the test set into validation and test sets
dataset_test_valid = dataset_train_valid_test_split["test"].select(range(5000)).train_test_split(test_size=0.5)

data_dict = DatasetDict({
    'train': dataset_train_valid_test_split['train'],  # Training data
    'test': dataset_test_valid['test'],                # Test data
    'valid': dataset_test_valid['train']               # Validation data
})

In [None]:
# Define source and target languages
src_lang = "en"                # Source language: English
tgt_lang = 'de'                # Target language: German

# Create prefix required by T5 model to identify the translation task
prefix = "translate English to German: "

# Define function to preprocess and tokenize the dataset
def preprocess(dataset):
    # Create input texts by adding prefix to English sentences
    inputs = [prefix + data[src_lang] for data in dataset["translation"]]

    # Extract German target sentences
    targets = [data[tgt_lang] for data in dataset["translation"]]

    # Tokenize both inputs and targets
    model_inputs = tokenizer(
        inputs,                # Source sentences with prefix
        text_target=targets,   # Target sentences
        max_length=128,        # Maximum sequence length
        truncation=True        # Truncate sequences longer than max_length
    )
    return model_inputs

# Apply preprocessing to all splits in the dataset
data_dict_tokenized = data_dict.map(preprocess, batched=True)

In [None]:
# Load the SacreBLEU evaluation metric
# SacreBLEU is a standardized BLEU score implementation for machine translation evaluation
bleu = evaluate.load("sacrebleu")

# BLEU (Bilingual Evaluation Understudy) Score Explanation
#
# Definition:
# - Metric for evaluating machine translation quality
# - Scores range from 0 to 100 (higher is better)
# - Compares machine translation with human reference(s)
#
# How BLEU Works:
# 1. N-gram Matching:
#    - Unigrams: Individual words
#    - Bigrams: Pairs of consecutive words
#    - Trigrams: Three consecutive words
#    - 4-grams: Four consecutive words
#
# Interpretation:
# - 0-15: Poor translation
# - 15-30: Understandable but with errors
# - 30-50: Good translation
# - 50+: High-quality translation
# - 100: Perfect match (extremely rare)
#
# Limitations:
# - Doesn't capture meaning preservation
# - Sensitive to word order
# - May not reflect human judgment perfectly

In [None]:
# Function to clean predictions and labels for evaluation
def clean_texts(preds, labels):
    # Remove leading/trailing whitespace from predictions
    preds = [pred.strip() for pred in preds]

    # Remove whitespace from labels and wrap each in a list
    # Labels are wrapped in lists because BLEU expects multiple references
    labels = [[label.strip()] for label in labels]

    return preds, labels

In [None]:
# Function to compute BLEU score for model evaluation
def compute_metrics(pred_labels):
    # Unpack predictions and labels
    preds, labels = pred_labels

    # Handle case where predictions are returned as tuple (happens during training)
    if isinstance(preds, tuple):
        preds = preds[0]

    # Decode predictions from token IDs back to text
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace padding tokens (-100) with the pad token ID
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    # Decode labels from token IDs back to text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Clean and format predictions and labels
    decoded_preds, decoded_labels = clean_texts(decoded_preds, decoded_labels)

    # Calculate BLEU score
    result = bleu.compute(predictions=decoded_preds, references=decoded_labels)
    # Extract just the score from the result
    result = {"bleu": result["score"]}

    return result

In [None]:
# Load the T5-small model and move it to GPU (device 0)
t5 = AutoModelForSeq2SeqLM.from_pretrained("t5-small")  # Load pre-trained T5 model
t5.to(0)  # Move model to first GPU (device ID 0)


In [None]:
# Create a data collator for sequence-to-sequence tasks
# DataCollatorForSeq2Seq handles batch preparation and dynamic padding
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,  # The tokenizer used to process text
    model=t5             # The T5 model being used
)


# - Pads sequences within each batch to same length
# - Uses maximum length in current batch only
# - More memory efficient than global padding

In [None]:
# Configure training arguments for fine-tuning the T5 model
training_args = Seq2SeqTrainingArguments(
    # Directory to save model checkpoints and logs
    output_dir="/content/drive/My Drive/machine_translation_t5",

    # Evaluate model after each epoch
    evaluation_strategy="epoch",

    # Learning rate for optimization
    learning_rate=2e-5,

    # Batch sizes for training and evaluation
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,

    # L2 regularization factor (This penalty encourages the model to use smaller weights during training)
    weight_decay=0.01,

    # Keep only the last 3 checkpoints
    save_total_limit=3,

    # Number of training epochs
    num_train_epochs=2,

    # Enable text generation during evaluation
    predict_with_generate=True,

    # Enable mixed precision training (faster training, less memory)
    fp16=True,

    # Disable external reporting
    report_to="none"
)

In [None]:
# Initialize the Sequence-to-Sequence Trainer
trainer = Seq2SeqTrainer(
    model=t5,                                    # Pre-trained T5 model
    args=training_args,                          # Training configuration
    train_dataset=data_dict_tokenized["train"],  # Training data
    eval_dataset=data_dict_tokenized["valid"],   # Validation data
    tokenizer=tokenizer,                         # Tokenizer for text processing
    data_collator=data_collator,                 # Handles batch preparation
    compute_metrics=compute_metrics,             # Evaluation metric function
)

In [None]:
# Evaluate the pre-trained model on test set before fine-tuning
prefinetuned_results = trainer.evaluate(
    eval_dataset=data_dict_tokenized["test"]  # Use test split for evaluation
)


In [None]:
# Display the BLEU score achieved on test set before fine-tuning
print("Test set bleu score before finetuning: ", prefinetuned_results["eval_bleu"])

In [None]:
# Test the model with a conversational example sentence
test_text = "translate English to German: I am a student of English language and Linguistics. I really like working with LLMs."

# Convert input text to token IDs and move to GPU
inputs_ids = tokenizer(test_text, return_tensors="pt").input_ids.to(0)

# Generate translation with specified parameters
outputs = t5.generate(
    inputs_ids,
    max_new_tokens=40,    # Maximum length of generated translation
    do_sample=True,       # Enable sampling for more natural output
    top_k=30,            # Consider top 30 tokens for sampling
    top_p=0.95,          # Nucleus sampling threshold
)

# Decode and print the translation
print("Translation of example sentence: ", tokenizer.decode(outputs[0], skip_special_tokens=True))


In [None]:
#Finetune the model
trainer.train()

In [None]:
# Evaluate the model on test set after fine-tuning
postfinetuned_results = trainer.evaluate(
    eval_dataset=data_dict_tokenized["test"]  # Use test split for evaluation
)

In [None]:
print("Test set bleu score after finetuning: ", postfinetuned_results["eval_bleu"])


In [None]:

#Try translating the example sentence again to see if the translation improved
test_text = "translate English to German: I am a student of English language and Linguistics. I really like working with LLMs."
inputs_ids = tokenizer(test_text, return_tensors="pt").input_ids.to(0)
outputs = t5.generate(inputs_ids, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
print("Translation of example sentence: ", tokenizer.decode(outputs[0], skip_special_tokens=True))