{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b",
      "metadata": {
        "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b"
      },
      "source": [
        "<tr>\n",
        "<td style=\"vertical-align:middle; text-align:left;\">\n",
        "<font size=\"2\">\n",
        "<br>This notebook is an adapted version of <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
        "</font>\n",
        "</td>\n",
        "</tr>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bfabadb8-5935-45ff-b39c-db7a29012129",
      "metadata": {
        "id": "bfabadb8-5935-45ff-b39c-db7a29012129"
      },
      "source": [
        "# Finetuning for Text Classification"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install tiktoken"
      ],
      "metadata": {
        "id": "9ogchkElyO0h"
      },
      "id": "9ogchkElyO0h",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90",
      "metadata": {
        "id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90"
      },
      "outputs": [],
      "source": [
        "from importlib.metadata import version\n",
        "\n",
        "pkgs = [\"matplotlib\",\n",
        "        \"numpy\",\n",
        "        \"tiktoken\",\n",
        "        \"torch\",\n",
        "        \"tensorflow\", # For OpenAI's pretrained weights\n",
        "        \"pandas\"      # Dataset loading\n",
        "       ]\n",
        "for p in pkgs:\n",
        "    print(f\"{p} version: {version(p)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a445828a-ff10-4efa-9f60-a2e2aed4c87d",
      "metadata": {
        "id": "a445828a-ff10-4efa-9f60-a2e2aed4c87d"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/chapter-overview.webp\" width=500px>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "946c3e56-b04b-4b0f-b35f-b485ce5b28df",
      "metadata": {
        "id": "946c3e56-b04b-4b0f-b35f-b485ce5b28df"
      },
      "outputs": [],
      "source": [
        "# Utility to prevent certain cells from being executed twice\n",
        "\n",
        "from IPython.core.magic import register_line_cell_magic\n",
        "\n",
        "executed_cells = set()\n",
        "\n",
        "@register_line_cell_magic\n",
        "def run_once(line, cell):\n",
        "    if line not in executed_cells:\n",
        "        get_ipython().run_cell(cell)\n",
        "        executed_cells.add(line)\n",
        "    else:\n",
        "        print(f\"Cell '{line}' has already been executed.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3a84cf35-b37f-4c15-8972-dfafc9fadc1c",
      "metadata": {
        "id": "3a84cf35-b37f-4c15-8972-dfafc9fadc1c"
      },
      "source": [
        "## Classification finetuning"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a7f60321-95b8-46a9-97bf-1d07fda2c3dd",
      "metadata": {
        "id": "a7f60321-95b8-46a9-97bf-1d07fda2c3dd"
      },
      "source": [
        "- Large Language Models start with general knowledge from training on vast amounts of data. They are not specialized in specific tasks\n",
        "- Finetuning is like giving specialized training to the above Large Language Model\n",
        "- Classification finetuning teaches a model to sort inputs into specific categories\n",
        "- In classification finetuning, we have a specific number of class labels (for example, \"spam\" and \"not spam\") that the model can output"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0b37a0c4-0bb1-4061-b1fe-eaa4416d52c3",
      "metadata": {
        "id": "0b37a0c4-0bb1-4061-b1fe-eaa4416d52c3"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/spam-non-spam.webp\" width=500px>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8c7017a2-32aa-4002-a2f3-12aac293ccdf",
      "metadata": {
        "id": "8c7017a2-32aa-4002-a2f3-12aac293ccdf"
      },
      "source": [
        "## Preparing the dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5f628975-d2e8-4f7f-ab38-92bb868b7067",
      "metadata": {
        "id": "5f628975-d2e8-4f7f-ab38-92bb868b7067"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/overview-1.webp\" width=500px>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9fbd459f-63fa-4d8c-8499-e23103156c7d",
      "metadata": {
        "id": "9fbd459f-63fa-4d8c-8499-e23103156c7d"
      },
      "source": [
        "- This section prepares the dataset we use for classification finetuning\n",
        "- We use a dataset consisting of spam and non-spam text messages to finetune the LLM to classify them\n",
        "- First, we download and unzip the dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "def7c09b-af9c-4216-90ce-5e67aed1065c",
      "metadata": {
        "id": "def7c09b-af9c-4216-90ce-5e67aed1065c"
      },
      "outputs": [],
      "source": [
        "import urllib.request\n",
        "import zipfile\n",
        "import os\n",
        "from pathlib import Path\n",
        "\n",
        "url = \"https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip\"\n",
        "zip_path = \"sms_spam_collection.zip\"\n",
        "extracted_path = \"sms_spam_collection\"\n",
        "data_file_path = Path(extracted_path) / \"SMSSpamCollection.tsv\"\n",
        "\n",
        "def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):\n",
        "    if data_file_path.exists():\n",
        "        print(f\"{data_file_path} already exists. Skipping download and extraction.\")\n",
        "        return\n",
        "\n",
        "    # Downloading the file\n",
        "    with urllib.request.urlopen(url) as response:\n",
        "        with open(zip_path, \"wb\") as out_file:\n",
        "            out_file.write(response.read())\n",
        "\n",
        "    # Unzipping the file\n",
        "    with zipfile.ZipFile(zip_path, \"r\") as zip_ref:\n",
        "        zip_ref.extractall(extracted_path)\n",
        "\n",
        "    # Add .tsv file extension\n",
        "    original_file_path = Path(extracted_path) / \"SMSSpamCollection\"\n",
        "    os.rename(original_file_path, data_file_path)\n",
        "    print(f\"File downloaded and saved as {data_file_path}\")\n",
        "\n",
        "download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6aac2d19-06d0-4005-916b-0bd4b1ee50d1",
      "metadata": {
        "id": "6aac2d19-06d0-4005-916b-0bd4b1ee50d1"
      },
      "source": [
        "- The dataset is saved as a tab-separated text file, which we can load into a pandas DataFrame"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "da0ed4da-ac31-4e4d-8bdd-2153be4656a4",
      "metadata": {
        "id": "da0ed4da-ac31-4e4d-8bdd-2153be4656a4"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "df = pd.read_csv(data_file_path, sep=\"\\t\", header=None, names=[\"Label\", \"Text\"])\n",
        "df"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e7b6e631-4f0b-4aab-82b9-8898e6663109",
      "metadata": {
        "id": "e7b6e631-4f0b-4aab-82b9-8898e6663109"
      },
      "source": [
        "- When we check the class distribution, we see that the data contains \"ham\" (i.e., \"not spam\") much more frequently than \"spam\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "495a5280-9d7c-41d4-9719-64ab99056d4c",
      "metadata": {
        "id": "495a5280-9d7c-41d4-9719-64ab99056d4c"
      },
      "outputs": [],
      "source": [
        "print(df[\"Label\"].value_counts())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f773f054-0bdc-4aad-bbf6-397621bf63db",
      "metadata": {
        "id": "f773f054-0bdc-4aad-bbf6-397621bf63db"
      },
      "source": [
        "- For simplicity, and because we prefer a small dataset for educational purposes anyway (it will make it possible to finetune the LLM faster), we subsample (undersample) the dataset so that it contains 747 instances from each class\n",
        "- (Next to undersampling, there are several other ways to deal with class balances, but they are out of the scope of a book on LLMs; you can find examples and more information in the [`imbalanced-learn` user guide](https://imbalanced-learn.org/stable/user_guide.html))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "7be4a0a2-9704-4a96-b38f-240339818688",
      "metadata": {
        "id": "7be4a0a2-9704-4a96-b38f-240339818688"
      },
      "outputs": [],
      "source": [
        "%%run_once balance_df\n",
        "\n",
        "\n",
        "def create_balanced_dataset(df):\n",
        "\n",
        "    # Count the instances of \"spam\"\n",
        "    num_spam = df[df[\"Label\"] == \"spam\"].shape[0]\n",
        "\n",
        "    # Randomly sample \"ham\" instances to match the number of \"spam\" instances\n",
        "    ham_subset = df[df[\"Label\"] == \"ham\"].sample(num_spam, random_state=123)\n",
        "\n",
        "    # Combine ham \"subset\" with \"spam\"\n",
        "    balanced_df = pd.concat([ham_subset, df[df[\"Label\"] == \"spam\"]])\n",
        "\n",
        "    return balanced_df\n",
        "\n",
        "\n",
        "balanced_df = create_balanced_dataset(df)\n",
        "print(balanced_df[\"Label\"].value_counts())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d3fd2f5a-06d8-4d30-a2e3-230b86c559d6",
      "metadata": {
        "id": "d3fd2f5a-06d8-4d30-a2e3-230b86c559d6"
      },
      "source": [
        "- Next, we change the string class labels \"ham\" and \"spam\" into integer class labels 0 and 1:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c1b10c3d-5d57-42d0-8de8-cf80a06f5ffd",
      "metadata": {
        "id": "c1b10c3d-5d57-42d0-8de8-cf80a06f5ffd"
      },
      "outputs": [],
      "source": [
        "%%run_once label_mapping\n",
        "balanced_df[\"Label\"] = balanced_df[\"Label\"].map({\"ham\": 0, \"spam\": 1})"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e6f7f062-ef4e-4020-8275-71990cab4414",
      "metadata": {
        "id": "e6f7f062-ef4e-4020-8275-71990cab4414"
      },
      "outputs": [],
      "source": [
        "balanced_df"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5715e685-35b4-4b45-a86c-8a8694de9d6f",
      "metadata": {
        "id": "5715e685-35b4-4b45-a86c-8a8694de9d6f"
      },
      "source": [
        "- Let's now define a function that randomly divides the dataset into training, validation, and test subsets"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "uQl0Psdmx15D",
      "metadata": {
        "id": "uQl0Psdmx15D"
      },
      "outputs": [],
      "source": [
        "def random_split(df, train_frac, validation_frac):\n",
        "    # Shuffle the entire DataFrame\n",
        "    df = df.sample(frac=1, random_state=123).reset_index(drop=True)\n",
        "\n",
        "    # Calculate split indices\n",
        "    train_end = int(len(df) * train_frac)\n",
        "    validation_end = train_end + int(len(df) * validation_frac)\n",
        "\n",
        "    # Split the DataFrame\n",
        "    train_df = df[:train_end]\n",
        "    validation_df = df[train_end:validation_end]\n",
        "    test_df = df[validation_end:]\n",
        "\n",
        "    return train_df, validation_df, test_df\n",
        "\n",
        "train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)\n",
        "# Test size is implied to be 0.2 as the remainder\n",
        "\n",
        "train_df.to_csv(\"train.csv\", index=None)\n",
        "validation_df.to_csv(\"validation.csv\", index=None)\n",
        "test_df.to_csv(\"test.csv\", index=None)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a8d7a0c5-1d5f-458a-b685-3f49520b0094",
      "metadata": {
        "id": "a8d7a0c5-1d5f-458a-b685-3f49520b0094"
      },
      "source": [
        "## Creating data loaders"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7126108a-75e7-4862-b0fb-cbf59a18bb6c",
      "metadata": {
        "id": "7126108a-75e7-4862-b0fb-cbf59a18bb6c"
      },
      "source": [
        "- Note that the text messages have different lengths; if we want to combine multiple training examples in a batch, we have to either\n",
        "  1. truncate all messages to the length of the shortest message in the dataset or batch\n",
        "  2. pad all messages to the length of the longest message in the dataset or batch\n",
        "\n",
        "- We choose option 2 and pad all messages to the longest message in the dataset\n",
        "- For that, we use `<|endoftext|>` as a padding token"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0829f33f-1428-4f22-9886-7fee633b3666",
      "metadata": {
        "id": "0829f33f-1428-4f22-9886-7fee633b3666"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/pad-input-sequences.webp?123\" width=500px>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7",
      "metadata": {
        "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7"
      },
      "outputs": [],
      "source": [
        "import tiktoken\n",
        "\n",
        "tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
        "print(tokenizer.encode(\"<|endoftext|>\", allowed_special={\"<|endoftext|>\"}))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "04f582ff-68bf-450e-bd87-5fb61afe431c",
      "metadata": {
        "id": "04f582ff-68bf-450e-bd87-5fb61afe431c"
      },
      "source": [
        "- The `SpamDataset` class below identifies the longest sequence in the training dataset and adds the padding token to the others to match that sequence length"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d7791b52-af18-4ac4-afa9-b921068e383e",
      "metadata": {
        "id": "d7791b52-af18-4ac4-afa9-b921068e383e"
      },
      "outputs": [],
      "source": [
        "import torch\n",
        "from torch.utils.data import Dataset\n",
        "\n",
        "\n",
        "class SpamDataset(Dataset):\n",
        "    def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):\n",
        "        self.data = pd.read_csv(csv_file)\n",
        "\n",
        "        # Pre-tokenize texts\n",
        "        self.encoded_texts = [\n",
        "            tokenizer.encode(text) for text in self.data[\"Text\"]\n",
        "        ]\n",
        "\n",
        "        if max_length is None:\n",
        "            self.max_length = self._longest_encoded_length()\n",
        "        else:\n",
        "            self.max_length = max_length\n",
        "            # Truncate sequences if they are longer than max_length\n",
        "            self.encoded_texts = [\n",
        "                encoded_text[:self.max_length]\n",
        "                for encoded_text in self.encoded_texts\n",
        "            ]\n",
        "\n",
        "        # Pad sequences to the longest sequence\n",
        "        self.encoded_texts = [\n",
        "            encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))\n",
        "            for encoded_text in self.encoded_texts\n",
        "        ]\n",
        "\n",
        "    def __getitem__(self, index):\n",
        "        encoded = self.encoded_texts[index]\n",
        "        label = self.data.iloc[index][\"Label\"]\n",
        "        return (\n",
        "            torch.tensor(encoded, dtype=torch.long),\n",
        "            torch.tensor(label, dtype=torch.long)\n",
        "        )\n",
        "\n",
        "    def __len__(self):\n",
        "        return len(self.data)\n",
        "\n",
        "    def _longest_encoded_length(self):\n",
        "        max_length = 0\n",
        "        for encoded_text in self.encoded_texts:\n",
        "            encoded_length = len(encoded_text)\n",
        "            if encoded_length > max_length:\n",
        "                max_length = encoded_length\n",
        "        return max_length"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "uzj85f8ou82h",
      "metadata": {
        "id": "uzj85f8ou82h"
      },
      "outputs": [],
      "source": [
        "train_dataset = SpamDataset(\n",
        "    csv_file=\"train.csv\",\n",
        "    max_length=None,\n",
        "    tokenizer=tokenizer\n",
        ")\n",
        "\n",
        "print(train_dataset.max_length)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "15bdd932-97eb-4b88-9cf9-d766ea4c3a60",
      "metadata": {
        "id": "15bdd932-97eb-4b88-9cf9-d766ea4c3a60"
      },
      "source": [
        "- We also pad the validation and test set to the longest training sequence\n",
        "- Note that validation and test set samples that are longer than the longest training example are being truncated via `encoded_text[:self.max_length]` in the `SpamDataset` code\n",
        "- This behavior is entirely optional, and it would also work well if we set `max_length=None` in both the validation and test set cases"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "bb0c502d-a75e-4248-8ea0-196e2b00c61e",
      "metadata": {
        "id": "bb0c502d-a75e-4248-8ea0-196e2b00c61e"
      },
      "outputs": [],
      "source": [
        "val_dataset = SpamDataset(\n",
        "    csv_file=\"validation.csv\",\n",
        "    max_length=train_dataset.max_length,\n",
        "    tokenizer=tokenizer\n",
        ")\n",
        "test_dataset = SpamDataset(\n",
        "    csv_file=\"test.csv\",\n",
        "    max_length=train_dataset.max_length,\n",
        "    tokenizer=tokenizer\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "20170d89-85a0-4844-9887-832f5d23432a",
      "metadata": {
        "id": "20170d89-85a0-4844-9887-832f5d23432a"
      },
      "source": [
        "- Next, we use the dataset to instantiate the data loaders, which is similar to creating the data loaders"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "64bcc349-205f-48f8-9655-95ff21f5e72f",
      "metadata": {
        "id": "64bcc349-205f-48f8-9655-95ff21f5e72f"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/batch.webp\" width=500px>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542",
      "metadata": {
        "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542"
      },
      "outputs": [],
      "source": [
        "from torch.utils.data import DataLoader\n",
        "\n",
        "num_workers = 0\n",
        "batch_size = 8\n",
        "\n",
        "torch.manual_seed(123)\n",
        "\n",
        "train_loader = DataLoader(\n",
        "    dataset=train_dataset,\n",
        "    batch_size=batch_size,\n",
        "    shuffle=True,\n",
        "    num_workers=num_workers,\n",
        "    drop_last=True,\n",
        ")\n",
        "\n",
        "val_loader = DataLoader(\n",
        "    dataset=val_dataset,\n",
        "    batch_size=batch_size,\n",
        "    num_workers=num_workers,\n",
        "    drop_last=False,\n",
        ")\n",
        "\n",
        "test_loader = DataLoader(\n",
        "    dataset=test_dataset,\n",
        "    batch_size=batch_size,\n",
        "    num_workers=num_workers,\n",
        "    drop_last=False,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ab7335db-e0bb-4e27-80c5-eea11e593a57",
      "metadata": {
        "id": "ab7335db-e0bb-4e27-80c5-eea11e593a57"
      },
      "source": [
        "- As a verification step, we iterate through the data loaders and ensure that the batches contain 8 training examples each, where each training example consists of 120 tokens"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4dee6882-4c3a-4964-af15-fa31f86ad047",
      "metadata": {
        "id": "4dee6882-4c3a-4964-af15-fa31f86ad047"
      },
      "outputs": [],
      "source": [
        "print(\"Train loader:\")\n",
        "for input_batch, target_batch in train_loader:\n",
        "    pass\n",
        "\n",
        "print(\"Input batch dimensions:\", input_batch.shape)\n",
        "print(\"Label batch dimensions\", target_batch.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5cdd7947-7039-49bf-8a5e-c0a2f4281ca1",
      "metadata": {
        "id": "5cdd7947-7039-49bf-8a5e-c0a2f4281ca1"
      },
      "source": [
        "- Lastly, let's print the total number of batches in each dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "IZfw-TYD2zTj",
      "metadata": {
        "id": "IZfw-TYD2zTj"
      },
      "outputs": [],
      "source": [
        "print(f\"{len(train_loader)} training batches\")\n",
        "print(f\"{len(val_loader)} validation batches\")\n",
        "print(f\"{len(test_loader)} test batches\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d1c4f61a-5f5d-4b3b-97cf-151b617d1d6c",
      "metadata": {
        "id": "d1c4f61a-5f5d-4b3b-97cf-151b617d1d6c"
      },
      "source": [
        "## Initializing a model with pretrained weights"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "97e1af8b-8bd1-4b44-8b8b-dc031496e208",
      "metadata": {
        "id": "97e1af8b-8bd1-4b44-8b8b-dc031496e208"
      },
      "source": [
        "- In this section, we initialize the pretrained model\n",
        "\n",
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/overview-2.webp\" width=500px>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2992d779-f9fb-4812-a117-553eb790a5a9",
      "metadata": {
        "id": "2992d779-f9fb-4812-a117-553eb790a5a9"
      },
      "outputs": [],
      "source": [
        "CHOOSE_MODEL = \"gpt2-small (124M)\"\n",
        "INPUT_PROMPT = \"Every effort moves\"\n",
        "\n",
        "BASE_CONFIG = {\n",
        "    \"vocab_size\": 50257,     # Vocabulary size\n",
        "    \"context_length\": 1024,  # Context length\n",
        "    \"drop_rate\": 0.0,        # Dropout rate\n",
        "    \"qkv_bias\": True         # Query-key-value bias\n",
        "}\n",
        "\n",
        "model_configs = {\n",
        "    \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n",
        "    \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n",
        "    \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n",
        "    \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n",
        "}\n",
        "\n",
        "BASE_CONFIG.update(model_configs[CHOOSE_MODEL])\n",
        "\n",
        "assert train_dataset.max_length <= BASE_CONFIG[\"context_length\"], (\n",
        "    f\"Dataset length {train_dataset.max_length} exceeds model's context \"\n",
        "    f\"length {BASE_CONFIG['context_length']}. Reinitialize data sets with \"\n",
        "    f\"`max_length={BASE_CONFIG['context_length']}`\"\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "def download_and_load_gpt2(model_size, models_dir):\n",
        "    # Validate model size\n",
        "    allowed_sizes = (\"124M\", \"355M\", \"774M\", \"1558M\")\n",
        "    if model_size not in allowed_sizes:\n",
        "        raise ValueError(f\"Model size not in {allowed_sizes}\")\n",
        "\n",
        "    # Define paths\n",
        "    model_dir = os.path.join(models_dir, model_size)\n",
        "    base_url = \"https://openaipublic.blob.core.windows.net/gpt-2/models\"\n",
        "    filenames = [\n",
        "        \"checkpoint\", \"encoder.json\", \"hparams.json\",\n",
        "        \"model.ckpt.data-00000-of-00001\", \"model.ckpt.index\",\n",
        "        \"model.ckpt.meta\", \"vocab.bpe\"\n",
        "    ]\n",
        "\n",
        "    # Download files\n",
        "    os.makedirs(model_dir, exist_ok=True)\n",
        "    for filename in filenames:\n",
        "        file_url = os.path.join(base_url, model_size, filename)\n",
        "        file_path = os.path.join(model_dir, filename)\n",
        "        download_file(file_url, file_path)\n",
        "\n",
        "    # Load settings and params\n",
        "    tf_ckpt_path = tf.train.latest_checkpoint(model_dir)\n",
        "    settings = json.load(open(os.path.join(model_dir, \"hparams.json\")))\n",
        "    params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings)\n",
        "\n",
        "    return settings, params\n",
        "\n",
        "\n",
        "def download_file(url, destination):\n",
        "    # Send a GET request to download the file\n",
        "\n",
        "    try:\n",
        "        with urllib.request.urlopen(url) as response:\n",
        "            # Get the total file size from headers, defaulting to 0 if not present\n",
        "            file_size = int(response.headers.get(\"Content-Length\", 0))\n",
        "\n",
        "            # Check if file exists and has the same size\n",
        "            if os.path.exists(destination):\n",
        "                file_size_local = os.path.getsize(destination)\n",
        "                if file_size == file_size_local:\n",
        "                    print(f\"File already exists and is up-to-date: {destination}\")\n",
        "                    return\n",
        "\n",
        "            # Define the block size for reading the file\n",
        "            block_size = 1024  # 1 Kilobyte\n",
        "\n",
        "            # Initialize the progress bar with total file size\n",
        "            progress_bar_description = os.path.basename(url)  # Extract filename from URL\n",
        "            with tqdm(total=file_size, unit=\"iB\", unit_scale=True, desc=progress_bar_description) as progress_bar:\n",
        "                # Open the destination file in binary write mode\n",
        "                with open(destination, \"wb\") as file:\n",
        "                    # Read the file in chunks and write to destination\n",
        "                    while True:\n",
        "                        chunk = response.read(block_size)\n",
        "                        if not chunk:\n",
        "                            break\n",
        "                        file.write(chunk)\n",
        "                        progress_bar.update(len(chunk))  # Update progress bar\n",
        "    except urllib.error.HTTPError:\n",
        "        s = (\n",
        "            f\"The specified URL ({url}) is incorrect, the internet connection cannot be established,\"\n",
        "            \"\\nor the requested file is temporarily unavailable.\\nPlease visit the following website\"\n",
        "            \" for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273\")\n",
        "        print(s)\n",
        "\n",
        "\n",
        "def load_gpt2_params_from_tf_ckpt(ckpt_path, settings):\n",
        "    # Initialize parameters dictionary with empty blocks for each layer\n",
        "    params = {\"blocks\": [{} for _ in range(settings[\"n_layer\"])]}\n",
        "\n",
        "    # Iterate over each variable in the checkpoint\n",
        "    for name, _ in tf.train.list_variables(ckpt_path):\n",
        "        # Load the variable and remove singleton dimensions\n",
        "        variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name))\n",
        "\n",
        "        # Process the variable name to extract relevant parts\n",
        "        variable_name_parts = name.split(\"/\")[1:]  # Skip the 'model/' prefix\n",
        "\n",
        "        # Identify the target dictionary for the variable\n",
        "        target_dict = params\n",
        "        if variable_name_parts[0].startswith(\"h\"):\n",
        "            layer_number = int(variable_name_parts[0][1:])\n",
        "            target_dict = params[\"blocks\"][layer_number]\n",
        "\n",
        "        # Recursively access or create nested dictionaries\n",
        "        for key in variable_name_parts[1:-1]:\n",
        "            target_dict = target_dict.setdefault(key, {})\n",
        "\n",
        "        # Assign the variable array to the last key\n",
        "        last_key = variable_name_parts[-1]\n",
        "        target_dict[last_key] = variable_array\n",
        "\n",
        "    return params"
      ],
      "metadata": {
        "id": "c9fj2xCc04Ih"
      },
      "id": "c9fj2xCc04Ih",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import torch.nn as nn\n",
        "from tqdm import tqdm\n",
        "import tensorflow as tf\n",
        "import json\n",
        "import numpy as np"
      ],
      "metadata": {
        "id": "DQS0UDx41BMv"
      },
      "id": "DQS0UDx41BMv",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Building the GPT model\n",
        "\n",
        "(same model as the one in gpt2.ipynb)"
      ],
      "metadata": {
        "id": "9usC7F_iALtQ"
      },
      "id": "9usC7F_iALtQ"
    },
    {
      "cell_type": "code",
      "source": [
        "class MultiHeadAttention(nn.Module):\n",
        "    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):\n",
        "        super().__init__()\n",
        "        assert d_out % num_heads == 0, \"d_out must be divisible by num_heads\"\n",
        "\n",
        "        self.d_out = d_out\n",
        "        self.num_heads = num_heads\n",
        "        self.head_dim = d_out // num_heads  # Reduce the projection dim to match desired output dim\n",
        "\n",
        "        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
        "        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
        "        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n",
        "        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs\n",
        "        self.dropout = nn.Dropout(dropout)\n",
        "        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))\n",
        "\n",
        "    def forward(self, x):\n",
        "        b, num_tokens, d_in = x.shape\n",
        "\n",
        "        keys = self.W_key(x)  # Shape: (b, num_tokens, d_out)\n",
        "        queries = self.W_query(x)\n",
        "        values = self.W_value(x)\n",
        "\n",
        "        # We implicitly split the matrix by adding a `num_heads` dimension\n",
        "        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)\n",
        "        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)\n",
        "        values = values.view(b, num_tokens, self.num_heads, self.head_dim)\n",
        "        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)\n",
        "\n",
        "        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)\n",
        "        keys = keys.transpose(1, 2)\n",
        "        queries = queries.transpose(1, 2)\n",
        "        values = values.transpose(1, 2)\n",
        "\n",
        "        # Compute scaled dot-product attention (aka self-attention) with a causal mask\n",
        "        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head\n",
        "\n",
        "        # Original mask truncated to the number of tokens and converted to boolean\n",
        "        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]\n",
        "\n",
        "        # Use the mask to fill attention scores\n",
        "        attn_scores.masked_fill_(mask_bool, -torch.inf)\n",
        "\n",
        "        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n",
        "        attn_weights = self.dropout(attn_weights)\n",
        "\n",
        "        # Shape: (b, num_tokens, num_heads, head_dim)\n",
        "        context_vec = (attn_weights @ values).transpose(1, 2)\n",
        "\n",
        "        # Combine heads, where self.d_out = self.num_heads * self.head_dim\n",
        "        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)\n",
        "        context_vec = self.out_proj(context_vec)  # optional projection\n",
        "\n",
        "        return context_vec\n",
        "\n",
        "class GELU(nn.Module):\n",
        "    def __init__(self):\n",
        "        super().__init__()\n",
        "\n",
        "    def forward(self, x):\n",
        "        return 0.5 * x * (1 + torch.tanh(\n",
        "            torch.sqrt(torch.tensor(2.0 / torch.pi)) *\n",
        "            (x + 0.044715 * torch.pow(x, 3))\n",
        "        ))\n",
        "\n",
        "class FeedForward(nn.Module):\n",
        "    def __init__(self, cfg):\n",
        "        super().__init__()\n",
        "        self.layers = nn.Sequential(\n",
        "            nn.Linear(cfg[\"emb_dim\"], 4 * cfg[\"emb_dim\"]),\n",
        "            GELU(),\n",
        "            nn.Linear(4 * cfg[\"emb_dim\"], cfg[\"emb_dim\"]),\n",
        "        )\n",
        "\n",
        "    def forward(self, x):\n",
        "        return self.layers(x)\n",
        "\n",
        "class LayerNorm(nn.Module):\n",
        "    def __init__(self, emb_dim):\n",
        "        super().__init__()\n",
        "        self.eps = 1e-5\n",
        "        self.scale = nn.Parameter(torch.ones(emb_dim))\n",
        "        self.shift = nn.Parameter(torch.zeros(emb_dim))\n",
        "\n",
        "    def forward(self, x):\n",
        "        mean = x.mean(dim=-1, keepdim=True)\n",
        "        var = x.var(dim=-1, keepdim=True, unbiased=False)\n",
        "        norm_x = (x - mean) / torch.sqrt(var + self.eps)\n",
        "        return self.scale * norm_x + self.shift\n",
        "\n",
        "\n",
        "class TransformerBlock(nn.Module):\n",
        "    def __init__(self, cfg):\n",
        "        super().__init__()\n",
        "        self.att = MultiHeadAttention(\n",
        "            d_in=cfg[\"emb_dim\"],\n",
        "            d_out=cfg[\"emb_dim\"],\n",
        "            context_length=cfg[\"context_length\"],\n",
        "            num_heads=cfg[\"n_heads\"],\n",
        "            dropout=cfg[\"drop_rate\"],\n",
        "            qkv_bias=cfg[\"qkv_bias\"])\n",
        "        self.ff = FeedForward(cfg)\n",
        "        self.norm1 = LayerNorm(cfg[\"emb_dim\"])\n",
        "        self.norm2 = LayerNorm(cfg[\"emb_dim\"])\n",
        "        self.drop_shortcut = nn.Dropout(cfg[\"drop_rate\"])\n",
        "\n",
        "    def forward(self, x):\n",
        "        # Shortcut connection for attention block\n",
        "        shortcut = x\n",
        "        x = self.norm1(x)\n",
        "        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]\n",
        "        x = self.drop_shortcut(x)\n",
        "        x = x + shortcut  # Add the original input back\n",
        "\n",
        "        # Shortcut connection for feed forward block\n",
        "        shortcut = x\n",
        "        x = self.norm2(x)\n",
        "        x = self.ff(x)\n",
        "        x = self.drop_shortcut(x)\n",
        "        x = x + shortcut  # Add the original input back\n",
        "\n",
        "        return x"
      ],
      "metadata": {
        "id": "oIVPE9Tg1YOz"
      },
      "id": "oIVPE9Tg1YOz",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "class GPTModel(nn.Module):\n",
        "    def __init__(self, cfg):\n",
        "        super().__init__()\n",
        "        self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n",
        "        self.pos_emb = nn.Embedding(cfg[\"context_length\"], cfg[\"emb_dim\"])\n",
        "        self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n",
        "\n",
        "        self.trf_blocks = nn.Sequential(\n",
        "            *[TransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n",
        "\n",
        "        self.final_norm = LayerNorm(cfg[\"emb_dim\"])\n",
        "        self.out_head = nn.Linear(\n",
        "            cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n",
        "        )\n",
        "\n",
        "    def forward(self, in_idx):\n",
        "        batch_size, seq_len = in_idx.shape\n",
        "        tok_embeds = self.tok_emb(in_idx)\n",
        "        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n",
        "        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]\n",
        "        x = self.drop_emb(x)\n",
        "        x = self.trf_blocks(x)\n",
        "        x = self.final_norm(x)\n",
        "        logits = self.out_head(x)\n",
        "        return logits"
      ],
      "metadata": {
        "id": "viShjjIF068N"
      },
      "id": "viShjjIF068N",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "def assign(left, right):\n",
        "    if left.shape != right.shape:\n",
        "        raise ValueError(f\"Shape mismatch. Left: {left.shape}, Right: {right.shape}\")\n",
        "    return torch.nn.Parameter(torch.tensor(right))\n",
        "\n",
        "def load_weights_into_gpt(gpt, params):\n",
        "    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])\n",
        "    gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])\n",
        "\n",
        "    for b in range(len(params[\"blocks\"])):\n",
        "        q_w, k_w, v_w = np.split(\n",
        "            (params[\"blocks\"][b][\"attn\"][\"c_attn\"])[\"w\"], 3, axis=-1)\n",
        "        gpt.trf_blocks[b].att.W_query.weight = assign(\n",
        "            gpt.trf_blocks[b].att.W_query.weight, q_w.T)\n",
        "        gpt.trf_blocks[b].att.W_key.weight = assign(\n",
        "            gpt.trf_blocks[b].att.W_key.weight, k_w.T)\n",
        "        gpt.trf_blocks[b].att.W_value.weight = assign(\n",
        "            gpt.trf_blocks[b].att.W_value.weight, v_w.T)\n",
        "\n",
        "        q_b, k_b, v_b = np.split(\n",
        "            (params[\"blocks\"][b][\"attn\"][\"c_attn\"])[\"b\"], 3, axis=-1)\n",
        "        gpt.trf_blocks[b].att.W_query.bias = assign(\n",
        "            gpt.trf_blocks[b].att.W_query.bias, q_b)\n",
        "        gpt.trf_blocks[b].att.W_key.bias = assign(\n",
        "            gpt.trf_blocks[b].att.W_key.bias, k_b)\n",
        "        gpt.trf_blocks[b].att.W_value.bias = assign(\n",
        "            gpt.trf_blocks[b].att.W_value.bias, v_b)\n",
        "\n",
        "        gpt.trf_blocks[b].att.out_proj.weight = assign(\n",
        "            gpt.trf_blocks[b].att.out_proj.weight,\n",
        "            params[\"blocks\"][b][\"attn\"][\"c_proj\"][\"w\"].T)\n",
        "        gpt.trf_blocks[b].att.out_proj.bias = assign(\n",
        "            gpt.trf_blocks[b].att.out_proj.bias,\n",
        "            params[\"blocks\"][b][\"attn\"][\"c_proj\"][\"b\"])\n",
        "\n",
        "        gpt.trf_blocks[b].ff.layers[0].weight = assign(\n",
        "            gpt.trf_blocks[b].ff.layers[0].weight,\n",
        "            params[\"blocks\"][b][\"mlp\"][\"c_fc\"][\"w\"].T)\n",
        "        gpt.trf_blocks[b].ff.layers[0].bias = assign(\n",
        "            gpt.trf_blocks[b].ff.layers[0].bias,\n",
        "            params[\"blocks\"][b][\"mlp\"][\"c_fc\"][\"b\"])\n",
        "        gpt.trf_blocks[b].ff.layers[2].weight = assign(\n",
        "            gpt.trf_blocks[b].ff.layers[2].weight,\n",
        "            params[\"blocks\"][b][\"mlp\"][\"c_proj\"][\"w\"].T)\n",
        "        gpt.trf_blocks[b].ff.layers[2].bias = assign(\n",
        "            gpt.trf_blocks[b].ff.layers[2].bias,\n",
        "            params[\"blocks\"][b][\"mlp\"][\"c_proj\"][\"b\"])\n",
        "\n",
        "        gpt.trf_blocks[b].norm1.scale = assign(\n",
        "            gpt.trf_blocks[b].norm1.scale,\n",
        "            params[\"blocks\"][b][\"ln_1\"][\"g\"])\n",
        "        gpt.trf_blocks[b].norm1.shift = assign(\n",
        "            gpt.trf_blocks[b].norm1.shift,\n",
        "            params[\"blocks\"][b][\"ln_1\"][\"b\"])\n",
        "        gpt.trf_blocks[b].norm2.scale = assign(\n",
        "            gpt.trf_blocks[b].norm2.scale,\n",
        "            params[\"blocks\"][b][\"ln_2\"][\"g\"])\n",
        "        gpt.trf_blocks[b].norm2.shift = assign(\n",
        "            gpt.trf_blocks[b].norm2.shift,\n",
        "            params[\"blocks\"][b][\"ln_2\"][\"b\"])\n",
        "\n",
        "    gpt.final_norm.scale = assign(gpt.final_norm.scale, params[\"g\"])\n",
        "    gpt.final_norm.shift = assign(gpt.final_norm.shift, params[\"b\"])\n",
        "    gpt.out_head.weight = assign(gpt.out_head.weight, params[\"wte\"])\n"
      ],
      "metadata": {
        "id": "6MN5g7ah1rhB"
      },
      "id": "6MN5g7ah1rhB",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "022a649a-44f5-466c-8a8e-326c063384f5",
      "metadata": {
        "id": "022a649a-44f5-466c-8a8e-326c063384f5"
      },
      "outputs": [],
      "source": [
        "model_size = CHOOSE_MODEL.split(\" \")[-1].lstrip(\"(\").rstrip(\")\")\n",
        "settings, params = download_and_load_gpt2(model_size=model_size, models_dir=\"gpt2\")\n",
        "\n",
        "model = GPTModel(BASE_CONFIG)\n",
        "load_weights_into_gpt(model, params)\n",
        "model.eval();"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ab8e056c-abe0-415f-b34d-df686204259e",
      "metadata": {
        "id": "ab8e056c-abe0-415f-b34d-df686204259e"
      },
      "source": [
        "- To ensure that the model was loaded correctly, let's double-check that it generates coherent text"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d8ac25ff-74b1-4149-8dc5-4c429d464330",
      "metadata": {
        "id": "d8ac25ff-74b1-4149-8dc5-4c429d464330"
      },
      "outputs": [],
      "source": [
        "def generate_text_simple(model, idx, max_new_tokens, context_size):\n",
        "    # idx is (batch, n_tokens) array of indices in the current context\n",
        "    for _ in range(max_new_tokens):\n",
        "\n",
        "        # Crop current context if it exceeds the supported context size\n",
        "        # E.g., if LLM supports only 5 tokens, and the context size is 10\n",
        "        # then only the last 5 tokens are used as context\n",
        "        idx_cond = idx[:, -context_size:]\n",
        "\n",
        "        # Get the predictions\n",
        "        with torch.no_grad():\n",
        "            logits = model(idx_cond)\n",
        "\n",
        "        # Focus only on the last time step\n",
        "        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)\n",
        "        logits = logits[:, -1, :]\n",
        "\n",
        "        # Apply softmax to get probabilities\n",
        "        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)\n",
        "\n",
        "        # Get the idx of the vocab entry with the highest probability value\n",
        "        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)\n",
        "\n",
        "        # Append sampled index to the running sequence\n",
        "        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)\n",
        "\n",
        "    return idx\n",
        "\n",
        "def text_to_token_ids(text, tokenizer):\n",
        "    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})\n",
        "    encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension\n",
        "    return encoded_tensor\n",
        "\n",
        "def token_ids_to_text(token_ids, tokenizer):\n",
        "    flat = token_ids.squeeze(0) # remove batch dimension\n",
        "    return tokenizer.decode(flat.tolist())\n",
        "\n",
        "\n",
        "text_1 = \"Every effort moves you\"\n",
        "\n",
        "token_ids = generate_text_simple(\n",
        "    model=model,\n",
        "    idx=text_to_token_ids(text_1, tokenizer),\n",
        "    max_new_tokens=15,\n",
        "    context_size=BASE_CONFIG[\"context_length\"]\n",
        ")\n",
        "\n",
        "print(token_ids_to_text(token_ids, tokenizer))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "69162550-6a02-4ece-8db1-06c71d61946f",
      "metadata": {
        "id": "69162550-6a02-4ece-8db1-06c71d61946f"
      },
      "source": [
        "- Before we finetune the model as a classifier, let's see if the model can perhaps already classify spam messages via prompting"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "94224aa9-c95a-4f8a-a420-76d01e3a800c",
      "metadata": {
        "id": "94224aa9-c95a-4f8a-a420-76d01e3a800c"
      },
      "outputs": [],
      "source": [
        "text_2 = (\n",
        "    \"Is the following text 'spam'? Answer with 'yes' or 'no':\"\n",
        "    \" 'You are a winner you have been specially\"\n",
        "    \" selected to receive $1000 cash or a $2000 award.'\"\n",
        ")\n",
        "\n",
        "token_ids = generate_text_simple(\n",
        "    model=model,\n",
        "    idx=text_to_token_ids(text_2, tokenizer),\n",
        "    max_new_tokens=23,\n",
        "    context_size=BASE_CONFIG[\"context_length\"]\n",
        ")\n",
        "\n",
        "print(token_ids_to_text(token_ids, tokenizer))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1ce39ed0-2c77-410d-8392-dd15d4b22016",
      "metadata": {
        "id": "1ce39ed0-2c77-410d-8392-dd15d4b22016"
      },
      "source": [
        "- As we can see, the model is not very good at following instructions\n",
        "- This is expected, since it has only been pretrained"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "4c9ae440-32f9-412f-96cf-fd52cc3e2522",
      "metadata": {
        "id": "4c9ae440-32f9-412f-96cf-fd52cc3e2522"
      },
      "source": [
        "## Adding a classification head"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d6e9d66f-76b2-40fc-9ec5-3f972a8db9c0",
      "metadata": {
        "id": "d6e9d66f-76b2-40fc-9ec5-3f972a8db9c0"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/lm-head.webp\" width=500px>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "217bac05-78df-4412-bd80-612f8061c01d",
      "metadata": {
        "id": "217bac05-78df-4412-bd80-612f8061c01d"
      },
      "source": [
        "- In this section, we are modifying the pretrained LLM to make it ready for classification finetuning\n",
        "- Let's take a look at the model architecture first"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "b23aff91-6bd0-48da-88f6-353657e6c981",
      "metadata": {
        "id": "b23aff91-6bd0-48da-88f6-353657e6c981"
      },
      "outputs": [],
      "source": [
        "print(model)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3f640a76-dd00-4769-9bc8-1aed0cec330d",
      "metadata": {
        "id": "3f640a76-dd00-4769-9bc8-1aed0cec330d"
      },
      "source": [
        "- Above, we can see the architecture we implemented in gpt2.ipynb (from previous class) neatly laid out\n",
        "- The goal is to replace and finetune the output layer\n",
        "- To achieve this, we first freeze the model, meaning that we make all layers non-trainable"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fkMWFl-0etea",
      "metadata": {
        "id": "fkMWFl-0etea"
      },
      "outputs": [],
      "source": [
        "for param in model.parameters():\n",
        "    param.requires_grad = False"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "72155f83-87d9-476a-a978-a15aa2d44147",
      "metadata": {
        "id": "72155f83-87d9-476a-a978-a15aa2d44147"
      },
      "source": [
        "- Then, we replace the output layer (`model.out_head`), which originally maps the layer inputs to 50,257 dimensions (the size of the vocabulary)\n",
        "- Since we finetune the model for binary classification (predicting 2 classes, \"spam\" and \"not spam\"), we can replace the output layer as shown below, which will be trainable by default\n",
        "- Note that we use `BASE_CONFIG[\"emb_dim\"]` (which is equal to 768 in the `\"gpt2-small (124M)\"` model) to keep the code below more general"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "7e759fa0-0f69-41be-b576-17e5f20e04cb",
      "metadata": {
        "id": "7e759fa0-0f69-41be-b576-17e5f20e04cb"
      },
      "outputs": [],
      "source": [
        "torch.manual_seed(123)\n",
        "\n",
        "num_classes = 2\n",
        "model.out_head = torch.nn.Linear(in_features=BASE_CONFIG[\"emb_dim\"], out_features=num_classes)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "30be5475-ae77-4f97-8f3e-dec462b1339f",
      "metadata": {
        "id": "30be5475-ae77-4f97-8f3e-dec462b1339f"
      },
      "source": [
        "- Technically, it's sufficient to only train the output layer\n",
        "- Experiments in [Finetuning Large Language Models](https://magazine.sebastianraschka.com/p/finetuning-large-language-models) show that finetuning additional layers can noticeably improve the performance\n",
        "- So, we are also making the last transformer block and the final `LayerNorm` module connecting the last transformer block to the output layer trainable"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0be7c1eb-c46c-4065-8525-eea1b8c66d10",
      "metadata": {
        "id": "0be7c1eb-c46c-4065-8525-eea1b8c66d10"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/trainable.webp\" width=500px>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2aedc120-5ee3-48f6-92f2-ad9304ebcdc7",
      "metadata": {
        "id": "2aedc120-5ee3-48f6-92f2-ad9304ebcdc7"
      },
      "outputs": [],
      "source": [
        "for param in model.trf_blocks[-1].parameters():\n",
        "    param.requires_grad = True\n",
        "\n",
        "for param in model.final_norm.parameters():\n",
        "    param.requires_grad = True"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f012b899-8284-4d3a-97c0-8a48eb33ba2e",
      "metadata": {
        "id": "f012b899-8284-4d3a-97c0-8a48eb33ba2e"
      },
      "source": [
        "- We can still use this model similar to before in gpt2.ipynb (from previous class)\n",
        "- For example, let's feed it some text input"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f645c06a-7df6-451c-ad3f-eafb18224ebc",
      "metadata": {
        "id": "f645c06a-7df6-451c-ad3f-eafb18224ebc"
      },
      "outputs": [],
      "source": [
        "inputs = tokenizer.encode(\"Do you have time\")\n",
        "inputs = torch.tensor(inputs).unsqueeze(0)\n",
        "print(\"Inputs:\", inputs)\n",
        "print(\"Inputs dimensions:\", inputs.shape) # shape: (batch_size, num_tokens)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "48dc84f1-85cc-4609-9cee-94ff539f00f4",
      "metadata": {
        "id": "48dc84f1-85cc-4609-9cee-94ff539f00f4"
      },
      "outputs": [],
      "source": [
        "with torch.no_grad():\n",
        "    outputs = model(inputs)\n",
        "\n",
        "print(\"Outputs:\\n\", outputs)\n",
        "print(\"Outputs dimensions:\", outputs.shape) # shape: (batch_size, num_tokens, num_classes)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "75430a01-ef9c-426a-aca0-664689c4f461",
      "metadata": {
        "id": "75430a01-ef9c-426a-aca0-664689c4f461"
      },
      "source": [
        "- For each input token, there's one output vector\n",
        "- Since we fed the model a text sample with 4 input tokens, the output consists of 4 2-dimensional output vectors above"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7df9144f-6817-4be4-8d4b-5d4dadfe4a9b",
      "metadata": {
        "id": "7df9144f-6817-4be4-8d4b-5d4dadfe4a9b"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/input-and-output.webp\" width=500px>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e3bb8616-c791-4f5c-bac0-5302f663e46a",
      "metadata": {
        "id": "e3bb8616-c791-4f5c-bac0-5302f663e46a"
      },
      "source": [
        "- We have learnt about the attention mechanism, which connects each input token to each other input token\n",
        "- We then also introduced the causal attention mask that is used in GPT-like models; this causal mask lets a current token only attend to the current and previous token positions\n",
        "- Based on this causal attention mechanism, the 4th (last) token contains the most information among all tokens because it's the only token that includes information about all other tokens\n",
        "- Hence, we are particularly interested in this last token, which we will finetune for the spam classification task"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "49383a8c-41d5-4dab-98f1-238bca0c2ed7",
      "metadata": {
        "id": "49383a8c-41d5-4dab-98f1-238bca0c2ed7"
      },
      "outputs": [],
      "source": [
        "print(\"Last output token:\", outputs[:, -1, :])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8df08ae0-e664-4670-b7c5-8a2280d9b41b",
      "metadata": {
        "id": "8df08ae0-e664-4670-b7c5-8a2280d9b41b"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/attention-mask.webp\" width=200px>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "32aa4aef-e1e9-491b-9adf-5aa973e59b8c",
      "metadata": {
        "id": "32aa4aef-e1e9-491b-9adf-5aa973e59b8c"
      },
      "source": [
        "## Calculating the classification loss and accuracy"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "669e1fd1-ace8-44b4-b438-185ed0ba8b33",
      "metadata": {
        "id": "669e1fd1-ace8-44b4-b438-185ed0ba8b33"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/overview-3.webp?1\" width=500px>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7a7df4ee-0a34-4a4d-896d-affbbf81e0b3",
      "metadata": {
        "id": "7a7df4ee-0a34-4a4d-896d-affbbf81e0b3"
      },
      "source": [
        "- Before explaining the loss calculation, let's have a brief look at how the model outputs are turned into class labels"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "557996dd-4c6b-49c4-ab83-f60ef7e1d69e",
      "metadata": {
        "id": "557996dd-4c6b-49c4-ab83-f60ef7e1d69e"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/class-argmax.webp\" width=600px>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c77faab1-3461-4118-866a-6171f2b89aa0",
      "metadata": {
        "id": "c77faab1-3461-4118-866a-6171f2b89aa0"
      },
      "outputs": [],
      "source": [
        "print(\"Last output token:\", outputs[:, -1, :])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7edd71fa-628a-4d00-b81d-6d8bcb2c341d",
      "metadata": {
        "id": "7edd71fa-628a-4d00-b81d-6d8bcb2c341d"
      },
      "source": [
        "- We convert the outputs (logits) into probability scores via the `softmax` function and then obtain the index position of the largest probability value via the `argmax` function"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "b81efa92-9be1-4b9e-8790-ce1fc7b17f01",
      "metadata": {
        "id": "b81efa92-9be1-4b9e-8790-ce1fc7b17f01"
      },
      "outputs": [],
      "source": [
        "probas = torch.softmax(outputs[:, -1, :], dim=-1)\n",
        "label = torch.argmax(probas)\n",
        "print(\"Class label:\", label.item())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "414a6f02-307e-4147-a416-14d115bf8179",
      "metadata": {
        "id": "414a6f02-307e-4147-a416-14d115bf8179"
      },
      "source": [
        "- Note that the softmax function is optional here, because the largest outputs correspond to the largest probability scores"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f9f9ad66-4969-4501-8239-3ccdb37e71a2",
      "metadata": {
        "id": "f9f9ad66-4969-4501-8239-3ccdb37e71a2"
      },
      "outputs": [],
      "source": [
        "logits = outputs[:, -1, :]\n",
        "label = torch.argmax(logits)\n",
        "print(\"Class label:\", label.item())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "dcb20d3a-cbba-4ab1-8584-d94e16589505",
      "metadata": {
        "id": "dcb20d3a-cbba-4ab1-8584-d94e16589505"
      },
      "source": [
        "- We can apply this concept to calculate the so-called classification accuracy, which computes the percentage of correct predictions in a given dataset\n",
        "- To calculate the classification accuracy, we can apply the preceding `argmax`-based prediction code to all examples in a dataset and calculate the fraction of correct predictions as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3ecf9572-aed0-4a21-9c3b-7f9f2aec5f23",
      "metadata": {
        "id": "3ecf9572-aed0-4a21-9c3b-7f9f2aec5f23"
      },
      "outputs": [],
      "source": [
        "def calc_accuracy_loader(data_loader, model, device, num_batches=None):\n",
        "    model.eval()\n",
        "    correct_predictions, num_examples = 0, 0\n",
        "\n",
        "    if num_batches is None:\n",
        "        num_batches = len(data_loader)\n",
        "    else:\n",
        "        num_batches = min(num_batches, len(data_loader))\n",
        "    for i, (input_batch, target_batch) in enumerate(data_loader):\n",
        "        if i < num_batches:\n",
        "            input_batch, target_batch = input_batch.to(device), target_batch.to(device)\n",
        "\n",
        "            with torch.no_grad():\n",
        "                logits = model(input_batch)[:, -1, :]  # Logits of last output token\n",
        "            predicted_labels = torch.argmax(logits, dim=-1)\n",
        "\n",
        "            num_examples += predicted_labels.shape[0]\n",
        "            correct_predictions += (predicted_labels == target_batch).sum().item()\n",
        "        else:\n",
        "            break\n",
        "    return correct_predictions / num_examples"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7165fe46-a284-410b-957f-7524877d1a1a",
      "metadata": {
        "id": "7165fe46-a284-410b-957f-7524877d1a1a"
      },
      "source": [
        "- Let's apply the function to calculate the classification accuracies for the different datasets:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "390e5255-8427-488c-adef-e1c10ab4fb26",
      "metadata": {
        "id": "390e5255-8427-488c-adef-e1c10ab4fb26"
      },
      "outputs": [],
      "source": [
        "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
        "\n",
        "# Note:\n",
        "# Uncommenting the following lines will allow the code to run on Apple Silicon chips, if applicable,\n",
        "# which is approximately 2x faster than on an Apple CPU (as measured on an M3 MacBook Air).\n",
        "# As of this writing, in PyTorch 2.4, the results obtained via CPU and MPS were identical.\n",
        "# However, in earlier versions of PyTorch, you may observe different results when using MPS.\n",
        "\n",
        "#if torch.cuda.is_available():\n",
        "#    device = torch.device(\"cuda\")\n",
        "#elif torch.backends.mps.is_available():\n",
        "#    device = torch.device(\"mps\")\n",
        "#else:\n",
        "#    device = torch.device(\"cpu\")\n",
        "#print(f\"Running on {device} device.\")\n",
        "\n",
        "model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes\n",
        "\n",
        "torch.manual_seed(123) # For reproducibility due to the shuffling in the training data loader\n",
        "\n",
        "train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)\n",
        "val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)\n",
        "test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)\n",
        "\n",
        "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n",
        "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n",
        "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "30345e2a-afed-4d22-9486-f4010f90a871",
      "metadata": {
        "id": "30345e2a-afed-4d22-9486-f4010f90a871"
      },
      "source": [
        "- As we can see, the prediction accuracies are not very good, since we haven't finetuned the model, yet"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "4f4a9d15-8fc7-48a2-8734-d92a2f265328",
      "metadata": {
        "id": "4f4a9d15-8fc7-48a2-8734-d92a2f265328"
      },
      "source": [
        "- Before we can start finetuning (/training), we first have to define the loss function we want to optimize during training\n",
        "- The goal is to maximize the spam classification accuracy of the model; however, classification accuracy is not a differentiable function\n",
        "- Hence, instead, we minimize the cross-entropy loss as a proxy for maximizing the classification accuracy\n",
        "- Here, we are only interested in optimizing the last token `model(input_batch)[:, -1, :]` instead of all tokens `model(input_batch)` in the `calc_loss_batch` function"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2f1e9547-806c-41a9-8aba-3b2822baabe4",
      "metadata": {
        "id": "2f1e9547-806c-41a9-8aba-3b2822baabe4"
      },
      "outputs": [],
      "source": [
        "def calc_loss_batch(input_batch, target_batch, model, device):\n",
        "    input_batch, target_batch = input_batch.to(device), target_batch.to(device)\n",
        "    logits = model(input_batch)[:, -1, :]  # Logits of last output token\n",
        "    loss = torch.nn.functional.cross_entropy(logits, target_batch)\n",
        "    return loss"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "b7b83e10-5720-45e7-ac5e-369417ca846b",
      "metadata": {
        "id": "b7b83e10-5720-45e7-ac5e-369417ca846b"
      },
      "outputs": [],
      "source": [
        "def calc_loss_loader(data_loader, model, device, num_batches=None):\n",
        "    total_loss = 0.\n",
        "    if len(data_loader) == 0:\n",
        "        return float(\"nan\")\n",
        "    elif num_batches is None:\n",
        "        num_batches = len(data_loader)\n",
        "    else:\n",
        "        # Reduce the number of batches to match the total number of batches in the data loader\n",
        "        # if num_batches exceeds the number of batches in the data loader\n",
        "        num_batches = min(num_batches, len(data_loader))\n",
        "    for i, (input_batch, target_batch) in enumerate(data_loader):\n",
        "        if i < num_batches:\n",
        "            loss = calc_loss_batch(input_batch, target_batch, model, device)\n",
        "            total_loss += loss.item()\n",
        "        else:\n",
        "            break\n",
        "    return total_loss / num_batches"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "56826ecd-6e74-40e6-b772-d3541e585067",
      "metadata": {
        "id": "56826ecd-6e74-40e6-b772-d3541e585067"
      },
      "source": [
        "- Using the `calc_closs_loader`, we compute the initial training, validation, and test set losses before we start training"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f6f00e53-5beb-4e64-b147-f26fd481c6ff",
      "metadata": {
        "id": "f6f00e53-5beb-4e64-b147-f26fd481c6ff"
      },
      "outputs": [],
      "source": [
        "with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet\n",
        "    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)\n",
        "    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)\n",
        "    test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)\n",
        "\n",
        "print(f\"Training loss: {train_loss:.3f}\")\n",
        "print(f\"Validation loss: {val_loss:.3f}\")\n",
        "print(f\"Test loss: {test_loss:.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e04b980b-e583-4f62-84a0-4edafaf99d5d",
      "metadata": {
        "id": "e04b980b-e583-4f62-84a0-4edafaf99d5d"
      },
      "source": [
        "- In the next section, we train the model to improve the loss values and consequently the classification accuracy"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "456ae0fd-6261-42b4-ab6a-d24289953083",
      "metadata": {
        "id": "456ae0fd-6261-42b4-ab6a-d24289953083"
      },
      "source": [
        "## Finetuning the model on supervised data"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6a9b099b-0829-4f72-8a2b-4363e3497026",
      "metadata": {
        "id": "6a9b099b-0829-4f72-8a2b-4363e3497026"
      },
      "source": [
        "- In this section, we define and use the training function to improve the classification accuracy of the model\n",
        "- The `train_classifier_simple` function below is practically the same as the `train_model_simple` function we used for pretraining the model in the gpt2.ipynb notebook\n",
        "- The only two differences are that we now\n",
        "  1. track the number of training examples seen (`examples_seen`) instead of the number of tokens seen\n",
        "  2. calculate the accuracy after each epoch instead of printing a sample text after each epoch"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "979b6222-1dc2-4530-9d01-b6b04fe3de12",
      "metadata": {
        "id": "979b6222-1dc2-4530-9d01-b6b04fe3de12"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/training-loop.webp?1\" width=500px>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "Csbr60to50FL",
      "metadata": {
        "id": "Csbr60to50FL"
      },
      "outputs": [],
      "source": [
        "def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,\n",
        "                            eval_freq, eval_iter):\n",
        "    # Initialize lists to track losses and examples seen\n",
        "    train_losses, val_losses, train_accs, val_accs = [], [], [], []\n",
        "    examples_seen, global_step = 0, -1\n",
        "\n",
        "    # Main training loop\n",
        "    for epoch in range(num_epochs):\n",
        "        model.train()  # Set model to training mode\n",
        "\n",
        "        for input_batch, target_batch in train_loader:\n",
        "            optimizer.zero_grad() # Reset loss gradients from previous batch iteration\n",
        "            loss = calc_loss_batch(input_batch, target_batch, model, device)\n",
        "            loss.backward() # Calculate loss gradients\n",
        "            optimizer.step() # Update model weights using loss gradients\n",
        "            examples_seen += input_batch.shape[0] # New: track examples instead of tokens\n",
        "            global_step += 1\n",
        "\n",
        "            # Optional evaluation step\n",
        "            if global_step % eval_freq == 0:\n",
        "                train_loss, val_loss = evaluate_model(\n",
        "                    model, train_loader, val_loader, device, eval_iter)\n",
        "                train_losses.append(train_loss)\n",
        "                val_losses.append(val_loss)\n",
        "                print(f\"Ep {epoch+1} (Step {global_step:06d}): \"\n",
        "                      f\"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}\")\n",
        "\n",
        "        # Calculate accuracy after each epoch\n",
        "        train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=eval_iter)\n",
        "        val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=eval_iter)\n",
        "        print(f\"Training accuracy: {train_accuracy*100:.2f}% | \", end=\"\")\n",
        "        print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n",
        "        train_accs.append(train_accuracy)\n",
        "        val_accs.append(val_accuracy)\n",
        "\n",
        "    return train_losses, val_losses, train_accs, val_accs, examples_seen"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9624cb30-3e3a-45be-b006-c00475b58ae8",
      "metadata": {
        "id": "9624cb30-3e3a-45be-b006-c00475b58ae8"
      },
      "source": [
        "- The `evaluate_model` function used in the `train_classifier_simple` is the same as the one we used in the gpt2.ipynb notebook"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "bcc7bc04-6aa6-4516-a147-460e2f466eab",
      "metadata": {
        "id": "bcc7bc04-6aa6-4516-a147-460e2f466eab"
      },
      "outputs": [],
      "source": [
        "def evaluate_model(model, train_loader, val_loader, device, eval_iter):\n",
        "    model.eval()\n",
        "    with torch.no_grad():\n",
        "        train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)\n",
        "        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)\n",
        "    model.train()\n",
        "    return train_loss, val_loss"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "X7kU3aAj7vTJ",
      "metadata": {
        "id": "X7kU3aAj7vTJ"
      },
      "outputs": [],
      "source": [
        "import time\n",
        "\n",
        "start_time = time.time()\n",
        "\n",
        "torch.manual_seed(123)\n",
        "\n",
        "optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)\n",
        "\n",
        "num_epochs = 5\n",
        "train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(\n",
        "    model, train_loader, val_loader, optimizer, device,\n",
        "    num_epochs=num_epochs, eval_freq=50, eval_iter=5,\n",
        ")\n",
        "\n",
        "end_time = time.time()\n",
        "execution_time_minutes = (end_time - start_time) / 60\n",
        "print(f\"Training completed in {execution_time_minutes:.2f} minutes.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1261bf90-3ce7-4591-895a-044a05538f30",
      "metadata": {
        "id": "1261bf90-3ce7-4591-895a-044a05538f30"
      },
      "source": [
        "- We use matplotlib to plot the loss function for the training and validation set"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cURgnDqdCeka",
      "metadata": {
        "id": "cURgnDqdCeka"
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "def plot_values(epochs_seen, examples_seen, train_values, val_values, label=\"loss\"):\n",
        "    fig, ax1 = plt.subplots(figsize=(5, 3))\n",
        "\n",
        "    # Plot training and validation loss against epochs\n",
        "    ax1.plot(epochs_seen, train_values, label=f\"Training {label}\")\n",
        "    ax1.plot(epochs_seen, val_values, linestyle=\"-.\", label=f\"Validation {label}\")\n",
        "    ax1.set_xlabel(\"Epochs\")\n",
        "    ax1.set_ylabel(label.capitalize())\n",
        "    ax1.legend()\n",
        "\n",
        "    # Create a second x-axis for examples seen\n",
        "    ax2 = ax1.twiny()  # Create a second x-axis that shares the same y-axis\n",
        "    ax2.plot(examples_seen, train_values, alpha=0)  # Invisible plot for aligning ticks\n",
        "    ax2.set_xlabel(\"Examples seen\")\n",
        "\n",
        "    fig.tight_layout()  # Adjust layout to make room\n",
        "    plt.savefig(f\"{label}-plot.pdf\")\n",
        "    plt.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "OIqRt466DiGk",
      "metadata": {
        "id": "OIqRt466DiGk"
      },
      "outputs": [],
      "source": [
        "epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))\n",
        "examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))\n",
        "\n",
        "plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "dbd28174-1836-44ba-b6c0-7e0be774fadc",
      "metadata": {
        "id": "dbd28174-1836-44ba-b6c0-7e0be774fadc"
      },
      "source": [
        "- Above, based on the downward slope, we see that the model learns well\n",
        "- Furthermore, the fact that the training and validation loss are very close indicates that the model does not tend to overfit the training data\n",
        "- Similarly, we can plot the accuracy below"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "yz8BIsaF0TUo",
      "metadata": {
        "id": "yz8BIsaF0TUo"
      },
      "outputs": [],
      "source": [
        "epochs_tensor = torch.linspace(0, num_epochs, len(train_accs))\n",
        "examples_seen_tensor = torch.linspace(0, examples_seen, len(train_accs))\n",
        "\n",
        "plot_values(epochs_tensor, examples_seen_tensor, train_accs, val_accs, label=\"accuracy\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "90aba699-21bc-42de-a69c-99f370bb0363",
      "metadata": {
        "id": "90aba699-21bc-42de-a69c-99f370bb0363"
      },
      "source": [
        "- Based on the accuracy plot above, we can see that the model achieves a relatively high training and validation accuracy after epochs 4 and 5\n",
        "- However, we have to keep in mind that we specified `eval_iter=5` in the training function earlier, which means that we only estimated the training and validation set performances\n",
        "- We can compute the training, validation, and test set performances over the complete dataset as follows below"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "UHWaJFrjY0zW",
      "metadata": {
        "id": "UHWaJFrjY0zW"
      },
      "outputs": [],
      "source": [
        "train_accuracy = calc_accuracy_loader(train_loader, model, device)\n",
        "val_accuracy = calc_accuracy_loader(val_loader, model, device)\n",
        "test_accuracy = calc_accuracy_loader(test_loader, model, device)\n",
        "\n",
        "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n",
        "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n",
        "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6882649f-dc7b-401f-84d2-024ff79c74a1",
      "metadata": {
        "id": "6882649f-dc7b-401f-84d2-024ff79c74a1"
      },
      "source": [
        "- We can see that the training and validation set performances are practically identical\n",
        "- However, based on the slightly lower test set performance, we can see that the model overfits the training data to a very small degree, as well as the validation data that has been used for tweaking some of the hyperparameters, such as the learning rate\n",
        "- This is normal, however, and this gap could potentially be further reduced by increasing the model's dropout rate (`drop_rate`) or the `weight_decay` in the optimizer setting"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a74d9ad7-3ec1-450e-8c9f-4fc46d3d5bb0",
      "metadata": {
        "id": "a74d9ad7-3ec1-450e-8c9f-4fc46d3d5bb0"
      },
      "source": [
        "## Using the LLM as a spam classifier"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "72ebcfa2-479e-408b-9cf0-7421f6144855",
      "metadata": {
        "id": "72ebcfa2-479e-408b-9cf0-7421f6144855"
      },
      "source": [
        "<img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/overview-4.webp\" width=500px>"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "fd5408e6-83e4-4e5a-8503-c2fba6073f31",
      "metadata": {
        "id": "fd5408e6-83e4-4e5a-8503-c2fba6073f31"
      },
      "source": [
        "- Finally, let's use the finetuned GPT model in action\n",
        "- The `classify_review` function below implements the data preprocessing steps similar to the `SpamDataset` we implemented earlier\n",
        "- Then, the function returns the predicted integer class label from the model and returns the corresponding class name"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "aHdn6xvL-IW5",
      "metadata": {
        "id": "aHdn6xvL-IW5"
      },
      "outputs": [],
      "source": [
        "def classify_review(text, model, tokenizer, device, max_length=None, pad_token_id=50256):\n",
        "    model.eval()\n",
        "\n",
        "    # Prepare inputs to the model\n",
        "    input_ids = tokenizer.encode(text)\n",
        "    supported_context_length = model.pos_emb.weight.shape[0]\n",
        "    # Note: In the book, this was originally written as pos_emb.weight.shape[1] by mistake\n",
        "    # It didn't break the code but would have caused unnecessary truncation (to 768 instead of 1024)\n",
        "\n",
        "    # Truncate sequences if they too long\n",
        "    input_ids = input_ids[:min(max_length, supported_context_length)]\n",
        "\n",
        "    # Pad sequences to the longest sequence\n",
        "    input_ids += [pad_token_id] * (max_length - len(input_ids))\n",
        "    input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0) # add batch dimension\n",
        "\n",
        "    # Model inference\n",
        "    with torch.no_grad():\n",
        "        logits = model(input_tensor)[:, -1, :]  # Logits of the last output token\n",
        "    predicted_label = torch.argmax(logits, dim=-1).item()\n",
        "\n",
        "    # Return the classified result\n",
        "    return \"spam\" if predicted_label == 1 else \"not spam\""
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f29682d8-a899-4d9b-b973-f8d5ec68172c",
      "metadata": {
        "id": "f29682d8-a899-4d9b-b973-f8d5ec68172c"
      },
      "source": [
        "- Let's try it out on a few examples below"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "apU_pf51AWSV",
      "metadata": {
        "id": "apU_pf51AWSV"
      },
      "outputs": [],
      "source": [
        "text_1 = (\n",
        "    \"You are a winner you have been specially\"\n",
        "    \" selected to receive $1000 cash or a $2000 award.\"\n",
        ")\n",
        "\n",
        "print(classify_review(\n",
        "    text_1, model, tokenizer, device, max_length=train_dataset.max_length\n",
        "))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1g5VTOo_Ajs5",
      "metadata": {
        "id": "1g5VTOo_Ajs5"
      },
      "outputs": [],
      "source": [
        "text_2 = (\n",
        "    \"Hey, just wanted to check if we're still on\"\n",
        "    \" for dinner tonight? Let me know!\"\n",
        ")\n",
        "\n",
        "print(classify_review(\n",
        "    text_2, model, tokenizer, device, max_length=train_dataset.max_length\n",
        "))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bf736e39-0d47-40c1-8d18-1f716cf7a81e",
      "metadata": {
        "id": "bf736e39-0d47-40c1-8d18-1f716cf7a81e"
      },
      "source": [
        "- Finally, let's save the model in case we want to reuse the model later without having to train it again"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "mYnX-gI1CfQY",
      "metadata": {
        "id": "mYnX-gI1CfQY"
      },
      "outputs": [],
      "source": [
        "torch.save(model.state_dict(), \"review_classifier.pth\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ba78cf7c-6b80-4f71-a50e-3ccc73839af6",
      "metadata": {
        "id": "ba78cf7c-6b80-4f71-a50e-3ccc73839af6"
      },
      "source": [
        "- Then, in a new session, we could load the model as follows"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cc4e68a5-d492-493b-87ef-45c475f353f5",
      "metadata": {
        "id": "cc4e68a5-d492-493b-87ef-45c475f353f5"
      },
      "outputs": [],
      "source": [
        "model_state_dict = torch.load(\"review_classifier.pth\", map_location=device, weights_only=True)\n",
        "model.load_state_dict(model_state_dict)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}