{ "cells": [ { "cell_type": "markdown", "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b", "metadata": { "id": "c024bfa4-1a7a-4751-b5a1-827225a3478b" }, "source": [ "\n", "\n", "\n", "
This notebook is an adapted version of https://github.com/rasbt/LLMs-from-scratch\n", "
\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "id": "bfabadb8-5935-45ff-b39c-db7a29012129", "metadata": { "id": "bfabadb8-5935-45ff-b39c-db7a29012129" }, "source": [ "# Finetuning for Text Classification" ] }, { "cell_type": "code", "source": [ "!pip install tiktoken" ], "metadata": { "id": "9ogchkElyO0h" }, "id": "9ogchkElyO0h", "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90", "metadata": { "id": "5b7e01c2-1c84-4f2a-bb51-2e0b74abda90" }, "outputs": [], "source": [ "from importlib.metadata import version\n", "\n", "pkgs = [\"matplotlib\",\n", " \"numpy\",\n", " \"tiktoken\",\n", " \"torch\",\n", " \"tensorflow\", # For OpenAI's pretrained weights\n", " \"pandas\" # Dataset loading\n", " ]\n", "for p in pkgs:\n", " print(f\"{p} version: {version(p)}\")" ] }, { "cell_type": "markdown", "id": "a445828a-ff10-4efa-9f60-a2e2aed4c87d", "metadata": { "id": "a445828a-ff10-4efa-9f60-a2e2aed4c87d" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "946c3e56-b04b-4b0f-b35f-b485ce5b28df", "metadata": { "id": "946c3e56-b04b-4b0f-b35f-b485ce5b28df" }, "outputs": [], "source": [ "# Utility to prevent certain cells from being executed twice\n", "\n", "from IPython.core.magic import register_line_cell_magic\n", "\n", "executed_cells = set()\n", "\n", "@register_line_cell_magic\n", "def run_once(line, cell):\n", " if line not in executed_cells:\n", " get_ipython().run_cell(cell)\n", " executed_cells.add(line)\n", " else:\n", " print(f\"Cell '{line}' has already been executed.\")" ] }, { "cell_type": "markdown", "id": "3a84cf35-b37f-4c15-8972-dfafc9fadc1c", "metadata": { "id": "3a84cf35-b37f-4c15-8972-dfafc9fadc1c" }, "source": [ "## Classification finetuning" ] }, { "cell_type": "markdown", "id": "a7f60321-95b8-46a9-97bf-1d07fda2c3dd", "metadata": { "id": "a7f60321-95b8-46a9-97bf-1d07fda2c3dd" }, "source": [ "- Large Language Models start with general knowledge from training on vast amounts of data. They are not specialized in specific tasks\n", "- Finetuning is like giving specialized training to the above Large Language Model\n", "- Classification finetuning teaches a model to sort inputs into specific categories\n", "- In classification finetuning, we have a specific number of class labels (for example, \"spam\" and \"not spam\") that the model can output" ] }, { "cell_type": "markdown", "id": "0b37a0c4-0bb1-4061-b1fe-eaa4416d52c3", "metadata": { "id": "0b37a0c4-0bb1-4061-b1fe-eaa4416d52c3" }, "source": [ "" ] }, { "cell_type": "markdown", "id": "8c7017a2-32aa-4002-a2f3-12aac293ccdf", "metadata": { "id": "8c7017a2-32aa-4002-a2f3-12aac293ccdf" }, "source": [ "## Preparing the dataset" ] }, { "cell_type": "markdown", "id": "5f628975-d2e8-4f7f-ab38-92bb868b7067", "metadata": { "id": "5f628975-d2e8-4f7f-ab38-92bb868b7067" }, "source": [ "" ] }, { "cell_type": "markdown", "id": "9fbd459f-63fa-4d8c-8499-e23103156c7d", "metadata": { "id": "9fbd459f-63fa-4d8c-8499-e23103156c7d" }, "source": [ "- This section prepares the dataset we use for classification finetuning\n", "- We use a dataset consisting of spam and non-spam text messages to finetune the LLM to classify them\n", "- First, we download and unzip the dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "def7c09b-af9c-4216-90ce-5e67aed1065c", "metadata": { "id": "def7c09b-af9c-4216-90ce-5e67aed1065c" }, "outputs": [], "source": [ "import urllib.request\n", "import zipfile\n", "import os\n", "from pathlib import Path\n", "\n", "url = \"https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip\"\n", "zip_path = \"sms_spam_collection.zip\"\n", "extracted_path = \"sms_spam_collection\"\n", "data_file_path = Path(extracted_path) / \"SMSSpamCollection.tsv\"\n", "\n", "def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):\n", " if data_file_path.exists():\n", " print(f\"{data_file_path} already exists. Skipping download and extraction.\")\n", " return\n", "\n", " # Downloading the file\n", " with urllib.request.urlopen(url) as response:\n", " with open(zip_path, \"wb\") as out_file:\n", " out_file.write(response.read())\n", "\n", " # Unzipping the file\n", " with zipfile.ZipFile(zip_path, \"r\") as zip_ref:\n", " zip_ref.extractall(extracted_path)\n", "\n", " # Add .tsv file extension\n", " original_file_path = Path(extracted_path) / \"SMSSpamCollection\"\n", " os.rename(original_file_path, data_file_path)\n", " print(f\"File downloaded and saved as {data_file_path}\")\n", "\n", "download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)" ] }, { "cell_type": "markdown", "id": "6aac2d19-06d0-4005-916b-0bd4b1ee50d1", "metadata": { "id": "6aac2d19-06d0-4005-916b-0bd4b1ee50d1" }, "source": [ "- The dataset is saved as a tab-separated text file, which we can load into a pandas DataFrame" ] }, { "cell_type": "code", "execution_count": null, "id": "da0ed4da-ac31-4e4d-8bdd-2153be4656a4", "metadata": { "id": "da0ed4da-ac31-4e4d-8bdd-2153be4656a4" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(data_file_path, sep=\"\\t\", header=None, names=[\"Label\", \"Text\"])\n", "df" ] }, { "cell_type": "markdown", "id": "e7b6e631-4f0b-4aab-82b9-8898e6663109", "metadata": { "id": "e7b6e631-4f0b-4aab-82b9-8898e6663109" }, "source": [ "- When we check the class distribution, we see that the data contains \"ham\" (i.e., \"not spam\") much more frequently than \"spam\"" ] }, { "cell_type": "code", "execution_count": null, "id": "495a5280-9d7c-41d4-9719-64ab99056d4c", "metadata": { "id": "495a5280-9d7c-41d4-9719-64ab99056d4c" }, "outputs": [], "source": [ "print(df[\"Label\"].value_counts())" ] }, { "cell_type": "markdown", "id": "f773f054-0bdc-4aad-bbf6-397621bf63db", "metadata": { "id": "f773f054-0bdc-4aad-bbf6-397621bf63db" }, "source": [ "- For simplicity, and because we prefer a small dataset for educational purposes anyway (it will make it possible to finetune the LLM faster), we subsample (undersample) the dataset so that it contains 747 instances from each class\n", "- (Next to undersampling, there are several other ways to deal with class balances, but they are out of the scope of a book on LLMs; you can find examples and more information in the [`imbalanced-learn` user guide](https://imbalanced-learn.org/stable/user_guide.html))" ] }, { "cell_type": "code", "execution_count": null, "id": "7be4a0a2-9704-4a96-b38f-240339818688", "metadata": { "id": "7be4a0a2-9704-4a96-b38f-240339818688" }, "outputs": [], "source": [ "%%run_once balance_df\n", "\n", "\n", "def create_balanced_dataset(df):\n", "\n", " # Count the instances of \"spam\"\n", " num_spam = df[df[\"Label\"] == \"spam\"].shape[0]\n", "\n", " # Randomly sample \"ham\" instances to match the number of \"spam\" instances\n", " ham_subset = df[df[\"Label\"] == \"ham\"].sample(num_spam, random_state=123)\n", "\n", " # Combine ham \"subset\" with \"spam\"\n", " balanced_df = pd.concat([ham_subset, df[df[\"Label\"] == \"spam\"]])\n", "\n", " return balanced_df\n", "\n", "\n", "balanced_df = create_balanced_dataset(df)\n", "print(balanced_df[\"Label\"].value_counts())" ] }, { "cell_type": "markdown", "id": "d3fd2f5a-06d8-4d30-a2e3-230b86c559d6", "metadata": { "id": "d3fd2f5a-06d8-4d30-a2e3-230b86c559d6" }, "source": [ "- Next, we change the string class labels \"ham\" and \"spam\" into integer class labels 0 and 1:" ] }, { "cell_type": "code", "execution_count": null, "id": "c1b10c3d-5d57-42d0-8de8-cf80a06f5ffd", "metadata": { "id": "c1b10c3d-5d57-42d0-8de8-cf80a06f5ffd" }, "outputs": [], "source": [ "%%run_once label_mapping\n", "balanced_df[\"Label\"] = balanced_df[\"Label\"].map({\"ham\": 0, \"spam\": 1})" ] }, { "cell_type": "code", "execution_count": null, "id": "e6f7f062-ef4e-4020-8275-71990cab4414", "metadata": { "id": "e6f7f062-ef4e-4020-8275-71990cab4414" }, "outputs": [], "source": [ "balanced_df" ] }, { "cell_type": "markdown", "id": "5715e685-35b4-4b45-a86c-8a8694de9d6f", "metadata": { "id": "5715e685-35b4-4b45-a86c-8a8694de9d6f" }, "source": [ "- Let's now define a function that randomly divides the dataset into training, validation, and test subsets" ] }, { "cell_type": "code", "execution_count": null, "id": "uQl0Psdmx15D", "metadata": { "id": "uQl0Psdmx15D" }, "outputs": [], "source": [ "def random_split(df, train_frac, validation_frac):\n", " # Shuffle the entire DataFrame\n", " df = df.sample(frac=1, random_state=123).reset_index(drop=True)\n", "\n", " # Calculate split indices\n", " train_end = int(len(df) * train_frac)\n", " validation_end = train_end + int(len(df) * validation_frac)\n", "\n", " # Split the DataFrame\n", " train_df = df[:train_end]\n", " validation_df = df[train_end:validation_end]\n", " test_df = df[validation_end:]\n", "\n", " return train_df, validation_df, test_df\n", "\n", "train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)\n", "# Test size is implied to be 0.2 as the remainder\n", "\n", "train_df.to_csv(\"train.csv\", index=None)\n", "validation_df.to_csv(\"validation.csv\", index=None)\n", "test_df.to_csv(\"test.csv\", index=None)" ] }, { "cell_type": "markdown", "id": "a8d7a0c5-1d5f-458a-b685-3f49520b0094", "metadata": { "id": "a8d7a0c5-1d5f-458a-b685-3f49520b0094" }, "source": [ "## Creating data loaders" ] }, { "cell_type": "markdown", "id": "7126108a-75e7-4862-b0fb-cbf59a18bb6c", "metadata": { "id": "7126108a-75e7-4862-b0fb-cbf59a18bb6c" }, "source": [ "- Note that the text messages have different lengths; if we want to combine multiple training examples in a batch, we have to either\n", " 1. truncate all messages to the length of the shortest message in the dataset or batch\n", " 2. pad all messages to the length of the longest message in the dataset or batch\n", "\n", "- We choose option 2 and pad all messages to the longest message in the dataset\n", "- For that, we use `<|endoftext|>` as a padding token" ] }, { "cell_type": "markdown", "id": "0829f33f-1428-4f22-9886-7fee633b3666", "metadata": { "id": "0829f33f-1428-4f22-9886-7fee633b3666" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7", "metadata": { "id": "74c3c463-8763-4cc0-9320-41c7eaad8ab7" }, "outputs": [], "source": [ "import tiktoken\n", "\n", "tokenizer = tiktoken.get_encoding(\"gpt2\")\n", "print(tokenizer.encode(\"<|endoftext|>\", allowed_special={\"<|endoftext|>\"}))" ] }, { "cell_type": "markdown", "id": "04f582ff-68bf-450e-bd87-5fb61afe431c", "metadata": { "id": "04f582ff-68bf-450e-bd87-5fb61afe431c" }, "source": [ "- The `SpamDataset` class below identifies the longest sequence in the training dataset and adds the padding token to the others to match that sequence length" ] }, { "cell_type": "code", "execution_count": null, "id": "d7791b52-af18-4ac4-afa9-b921068e383e", "metadata": { "id": "d7791b52-af18-4ac4-afa9-b921068e383e" }, "outputs": [], "source": [ "import torch\n", "from torch.utils.data import Dataset\n", "\n", "\n", "class SpamDataset(Dataset):\n", " def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):\n", " self.data = pd.read_csv(csv_file)\n", "\n", " # Pre-tokenize texts\n", " self.encoded_texts = [\n", " tokenizer.encode(text) for text in self.data[\"Text\"]\n", " ]\n", "\n", " if max_length is None:\n", " self.max_length = self._longest_encoded_length()\n", " else:\n", " self.max_length = max_length\n", " # Truncate sequences if they are longer than max_length\n", " self.encoded_texts = [\n", " encoded_text[:self.max_length]\n", " for encoded_text in self.encoded_texts\n", " ]\n", "\n", " # Pad sequences to the longest sequence\n", " self.encoded_texts = [\n", " encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))\n", " for encoded_text in self.encoded_texts\n", " ]\n", "\n", " def __getitem__(self, index):\n", " encoded = self.encoded_texts[index]\n", " label = self.data.iloc[index][\"Label\"]\n", " return (\n", " torch.tensor(encoded, dtype=torch.long),\n", " torch.tensor(label, dtype=torch.long)\n", " )\n", "\n", " def __len__(self):\n", " return len(self.data)\n", "\n", " def _longest_encoded_length(self):\n", " max_length = 0\n", " for encoded_text in self.encoded_texts:\n", " encoded_length = len(encoded_text)\n", " if encoded_length > max_length:\n", " max_length = encoded_length\n", " return max_length" ] }, { "cell_type": "code", "execution_count": null, "id": "uzj85f8ou82h", "metadata": { "id": "uzj85f8ou82h" }, "outputs": [], "source": [ "train_dataset = SpamDataset(\n", " csv_file=\"train.csv\",\n", " max_length=None,\n", " tokenizer=tokenizer\n", ")\n", "\n", "print(train_dataset.max_length)" ] }, { "cell_type": "markdown", "id": "15bdd932-97eb-4b88-9cf9-d766ea4c3a60", "metadata": { "id": "15bdd932-97eb-4b88-9cf9-d766ea4c3a60" }, "source": [ "- We also pad the validation and test set to the longest training sequence\n", "- Note that validation and test set samples that are longer than the longest training example are being truncated via `encoded_text[:self.max_length]` in the `SpamDataset` code\n", "- This behavior is entirely optional, and it would also work well if we set `max_length=None` in both the validation and test set cases" ] }, { "cell_type": "code", "execution_count": null, "id": "bb0c502d-a75e-4248-8ea0-196e2b00c61e", "metadata": { "id": "bb0c502d-a75e-4248-8ea0-196e2b00c61e" }, "outputs": [], "source": [ "val_dataset = SpamDataset(\n", " csv_file=\"validation.csv\",\n", " max_length=train_dataset.max_length,\n", " tokenizer=tokenizer\n", ")\n", "test_dataset = SpamDataset(\n", " csv_file=\"test.csv\",\n", " max_length=train_dataset.max_length,\n", " tokenizer=tokenizer\n", ")" ] }, { "cell_type": "markdown", "id": "20170d89-85a0-4844-9887-832f5d23432a", "metadata": { "id": "20170d89-85a0-4844-9887-832f5d23432a" }, "source": [ "- Next, we use the dataset to instantiate the data loaders, which is similar to creating the data loaders" ] }, { "cell_type": "markdown", "id": "64bcc349-205f-48f8-9655-95ff21f5e72f", "metadata": { "id": "64bcc349-205f-48f8-9655-95ff21f5e72f" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542", "metadata": { "id": "8681adc0-6f02-4e75-b01a-a6ab75d05542" }, "outputs": [], "source": [ "from torch.utils.data import DataLoader\n", "\n", "num_workers = 0\n", "batch_size = 8\n", "\n", "torch.manual_seed(123)\n", "\n", "train_loader = DataLoader(\n", " dataset=train_dataset,\n", " batch_size=batch_size,\n", " shuffle=True,\n", " num_workers=num_workers,\n", " drop_last=True,\n", ")\n", "\n", "val_loader = DataLoader(\n", " dataset=val_dataset,\n", " batch_size=batch_size,\n", " num_workers=num_workers,\n", " drop_last=False,\n", ")\n", "\n", "test_loader = DataLoader(\n", " dataset=test_dataset,\n", " batch_size=batch_size,\n", " num_workers=num_workers,\n", " drop_last=False,\n", ")" ] }, { "cell_type": "markdown", "id": "ab7335db-e0bb-4e27-80c5-eea11e593a57", "metadata": { "id": "ab7335db-e0bb-4e27-80c5-eea11e593a57" }, "source": [ "- As a verification step, we iterate through the data loaders and ensure that the batches contain 8 training examples each, where each training example consists of 120 tokens" ] }, { "cell_type": "code", "execution_count": null, "id": "4dee6882-4c3a-4964-af15-fa31f86ad047", "metadata": { "id": "4dee6882-4c3a-4964-af15-fa31f86ad047" }, "outputs": [], "source": [ "print(\"Train loader:\")\n", "for input_batch, target_batch in train_loader:\n", " pass\n", "\n", "print(\"Input batch dimensions:\", input_batch.shape)\n", "print(\"Label batch dimensions\", target_batch.shape)" ] }, { "cell_type": "markdown", "id": "5cdd7947-7039-49bf-8a5e-c0a2f4281ca1", "metadata": { "id": "5cdd7947-7039-49bf-8a5e-c0a2f4281ca1" }, "source": [ "- Lastly, let's print the total number of batches in each dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "IZfw-TYD2zTj", "metadata": { "id": "IZfw-TYD2zTj" }, "outputs": [], "source": [ "print(f\"{len(train_loader)} training batches\")\n", "print(f\"{len(val_loader)} validation batches\")\n", "print(f\"{len(test_loader)} test batches\")" ] }, { "cell_type": "markdown", "id": "d1c4f61a-5f5d-4b3b-97cf-151b617d1d6c", "metadata": { "id": "d1c4f61a-5f5d-4b3b-97cf-151b617d1d6c" }, "source": [ "## Initializing a model with pretrained weights" ] }, { "cell_type": "markdown", "id": "97e1af8b-8bd1-4b44-8b8b-dc031496e208", "metadata": { "id": "97e1af8b-8bd1-4b44-8b8b-dc031496e208" }, "source": [ "- In this section, we initialize the pretrained model\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "2992d779-f9fb-4812-a117-553eb790a5a9", "metadata": { "id": "2992d779-f9fb-4812-a117-553eb790a5a9" }, "outputs": [], "source": [ "CHOOSE_MODEL = \"gpt2-small (124M)\"\n", "INPUT_PROMPT = \"Every effort moves\"\n", "\n", "BASE_CONFIG = {\n", " \"vocab_size\": 50257, # Vocabulary size\n", " \"context_length\": 1024, # Context length\n", " \"drop_rate\": 0.0, # Dropout rate\n", " \"qkv_bias\": True # Query-key-value bias\n", "}\n", "\n", "model_configs = {\n", " \"gpt2-small (124M)\": {\"emb_dim\": 768, \"n_layers\": 12, \"n_heads\": 12},\n", " \"gpt2-medium (355M)\": {\"emb_dim\": 1024, \"n_layers\": 24, \"n_heads\": 16},\n", " \"gpt2-large (774M)\": {\"emb_dim\": 1280, \"n_layers\": 36, \"n_heads\": 20},\n", " \"gpt2-xl (1558M)\": {\"emb_dim\": 1600, \"n_layers\": 48, \"n_heads\": 25},\n", "}\n", "\n", "BASE_CONFIG.update(model_configs[CHOOSE_MODEL])\n", "\n", "assert train_dataset.max_length <= BASE_CONFIG[\"context_length\"], (\n", " f\"Dataset length {train_dataset.max_length} exceeds model's context \"\n", " f\"length {BASE_CONFIG['context_length']}. Reinitialize data sets with \"\n", " f\"`max_length={BASE_CONFIG['context_length']}`\"\n", ")" ] }, { "cell_type": "code", "source": [ "def download_and_load_gpt2(model_size, models_dir):\n", " # Validate model size\n", " allowed_sizes = (\"124M\", \"355M\", \"774M\", \"1558M\")\n", " if model_size not in allowed_sizes:\n", " raise ValueError(f\"Model size not in {allowed_sizes}\")\n", "\n", " # Define paths\n", " model_dir = os.path.join(models_dir, model_size)\n", " base_url = \"https://openaipublic.blob.core.windows.net/gpt-2/models\"\n", " filenames = [\n", " \"checkpoint\", \"encoder.json\", \"hparams.json\",\n", " \"model.ckpt.data-00000-of-00001\", \"model.ckpt.index\",\n", " \"model.ckpt.meta\", \"vocab.bpe\"\n", " ]\n", "\n", " # Download files\n", " os.makedirs(model_dir, exist_ok=True)\n", " for filename in filenames:\n", " file_url = os.path.join(base_url, model_size, filename)\n", " file_path = os.path.join(model_dir, filename)\n", " download_file(file_url, file_path)\n", "\n", " # Load settings and params\n", " tf_ckpt_path = tf.train.latest_checkpoint(model_dir)\n", " settings = json.load(open(os.path.join(model_dir, \"hparams.json\")))\n", " params = load_gpt2_params_from_tf_ckpt(tf_ckpt_path, settings)\n", "\n", " return settings, params\n", "\n", "\n", "def download_file(url, destination):\n", " # Send a GET request to download the file\n", "\n", " try:\n", " with urllib.request.urlopen(url) as response:\n", " # Get the total file size from headers, defaulting to 0 if not present\n", " file_size = int(response.headers.get(\"Content-Length\", 0))\n", "\n", " # Check if file exists and has the same size\n", " if os.path.exists(destination):\n", " file_size_local = os.path.getsize(destination)\n", " if file_size == file_size_local:\n", " print(f\"File already exists and is up-to-date: {destination}\")\n", " return\n", "\n", " # Define the block size for reading the file\n", " block_size = 1024 # 1 Kilobyte\n", "\n", " # Initialize the progress bar with total file size\n", " progress_bar_description = os.path.basename(url) # Extract filename from URL\n", " with tqdm(total=file_size, unit=\"iB\", unit_scale=True, desc=progress_bar_description) as progress_bar:\n", " # Open the destination file in binary write mode\n", " with open(destination, \"wb\") as file:\n", " # Read the file in chunks and write to destination\n", " while True:\n", " chunk = response.read(block_size)\n", " if not chunk:\n", " break\n", " file.write(chunk)\n", " progress_bar.update(len(chunk)) # Update progress bar\n", " except urllib.error.HTTPError:\n", " s = (\n", " f\"The specified URL ({url}) is incorrect, the internet connection cannot be established,\"\n", " \"\\nor the requested file is temporarily unavailable.\\nPlease visit the following website\"\n", " \" for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273\")\n", " print(s)\n", "\n", "\n", "def load_gpt2_params_from_tf_ckpt(ckpt_path, settings):\n", " # Initialize parameters dictionary with empty blocks for each layer\n", " params = {\"blocks\": [{} for _ in range(settings[\"n_layer\"])]}\n", "\n", " # Iterate over each variable in the checkpoint\n", " for name, _ in tf.train.list_variables(ckpt_path):\n", " # Load the variable and remove singleton dimensions\n", " variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name))\n", "\n", " # Process the variable name to extract relevant parts\n", " variable_name_parts = name.split(\"/\")[1:] # Skip the 'model/' prefix\n", "\n", " # Identify the target dictionary for the variable\n", " target_dict = params\n", " if variable_name_parts[0].startswith(\"h\"):\n", " layer_number = int(variable_name_parts[0][1:])\n", " target_dict = params[\"blocks\"][layer_number]\n", "\n", " # Recursively access or create nested dictionaries\n", " for key in variable_name_parts[1:-1]:\n", " target_dict = target_dict.setdefault(key, {})\n", "\n", " # Assign the variable array to the last key\n", " last_key = variable_name_parts[-1]\n", " target_dict[last_key] = variable_array\n", "\n", " return params" ], "metadata": { "id": "c9fj2xCc04Ih" }, "id": "c9fj2xCc04Ih", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import torch.nn as nn\n", "from tqdm import tqdm\n", "import tensorflow as tf\n", "import json\n", "import numpy as np" ], "metadata": { "id": "DQS0UDx41BMv" }, "id": "DQS0UDx41BMv", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Building the GPT model\n", "\n", "(same model as the one in gpt2.ipynb)" ], "metadata": { "id": "9usC7F_iALtQ" }, "id": "9usC7F_iALtQ" }, { "cell_type": "code", "source": [ "class MultiHeadAttention(nn.Module):\n", " def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):\n", " super().__init__()\n", " assert d_out % num_heads == 0, \"d_out must be divisible by num_heads\"\n", "\n", " self.d_out = d_out\n", " self.num_heads = num_heads\n", " self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim\n", "\n", " self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)\n", " self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)\n", " self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)\n", " self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs\n", " self.dropout = nn.Dropout(dropout)\n", " self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))\n", "\n", " def forward(self, x):\n", " b, num_tokens, d_in = x.shape\n", "\n", " keys = self.W_key(x) # Shape: (b, num_tokens, d_out)\n", " queries = self.W_query(x)\n", " values = self.W_value(x)\n", "\n", " # We implicitly split the matrix by adding a `num_heads` dimension\n", " # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)\n", " keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)\n", " values = values.view(b, num_tokens, self.num_heads, self.head_dim)\n", " queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)\n", "\n", " # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)\n", " keys = keys.transpose(1, 2)\n", " queries = queries.transpose(1, 2)\n", " values = values.transpose(1, 2)\n", "\n", " # Compute scaled dot-product attention (aka self-attention) with a causal mask\n", " attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head\n", "\n", " # Original mask truncated to the number of tokens and converted to boolean\n", " mask_bool = self.mask.bool()[:num_tokens, :num_tokens]\n", "\n", " # Use the mask to fill attention scores\n", " attn_scores.masked_fill_(mask_bool, -torch.inf)\n", "\n", " attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)\n", " attn_weights = self.dropout(attn_weights)\n", "\n", " # Shape: (b, num_tokens, num_heads, head_dim)\n", " context_vec = (attn_weights @ values).transpose(1, 2)\n", "\n", " # Combine heads, where self.d_out = self.num_heads * self.head_dim\n", " context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)\n", " context_vec = self.out_proj(context_vec) # optional projection\n", "\n", " return context_vec\n", "\n", "class GELU(nn.Module):\n", " def __init__(self):\n", " super().__init__()\n", "\n", " def forward(self, x):\n", " return 0.5 * x * (1 + torch.tanh(\n", " torch.sqrt(torch.tensor(2.0 / torch.pi)) *\n", " (x + 0.044715 * torch.pow(x, 3))\n", " ))\n", "\n", "class FeedForward(nn.Module):\n", " def __init__(self, cfg):\n", " super().__init__()\n", " self.layers = nn.Sequential(\n", " nn.Linear(cfg[\"emb_dim\"], 4 * cfg[\"emb_dim\"]),\n", " GELU(),\n", " nn.Linear(4 * cfg[\"emb_dim\"], cfg[\"emb_dim\"]),\n", " )\n", "\n", " def forward(self, x):\n", " return self.layers(x)\n", "\n", "class LayerNorm(nn.Module):\n", " def __init__(self, emb_dim):\n", " super().__init__()\n", " self.eps = 1e-5\n", " self.scale = nn.Parameter(torch.ones(emb_dim))\n", " self.shift = nn.Parameter(torch.zeros(emb_dim))\n", "\n", " def forward(self, x):\n", " mean = x.mean(dim=-1, keepdim=True)\n", " var = x.var(dim=-1, keepdim=True, unbiased=False)\n", " norm_x = (x - mean) / torch.sqrt(var + self.eps)\n", " return self.scale * norm_x + self.shift\n", "\n", "\n", "class TransformerBlock(nn.Module):\n", " def __init__(self, cfg):\n", " super().__init__()\n", " self.att = MultiHeadAttention(\n", " d_in=cfg[\"emb_dim\"],\n", " d_out=cfg[\"emb_dim\"],\n", " context_length=cfg[\"context_length\"],\n", " num_heads=cfg[\"n_heads\"],\n", " dropout=cfg[\"drop_rate\"],\n", " qkv_bias=cfg[\"qkv_bias\"])\n", " self.ff = FeedForward(cfg)\n", " self.norm1 = LayerNorm(cfg[\"emb_dim\"])\n", " self.norm2 = LayerNorm(cfg[\"emb_dim\"])\n", " self.drop_shortcut = nn.Dropout(cfg[\"drop_rate\"])\n", "\n", " def forward(self, x):\n", " # Shortcut connection for attention block\n", " shortcut = x\n", " x = self.norm1(x)\n", " x = self.att(x) # Shape [batch_size, num_tokens, emb_size]\n", " x = self.drop_shortcut(x)\n", " x = x + shortcut # Add the original input back\n", "\n", " # Shortcut connection for feed forward block\n", " shortcut = x\n", " x = self.norm2(x)\n", " x = self.ff(x)\n", " x = self.drop_shortcut(x)\n", " x = x + shortcut # Add the original input back\n", "\n", " return x" ], "metadata": { "id": "oIVPE9Tg1YOz" }, "id": "oIVPE9Tg1YOz", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "class GPTModel(nn.Module):\n", " def __init__(self, cfg):\n", " super().__init__()\n", " self.tok_emb = nn.Embedding(cfg[\"vocab_size\"], cfg[\"emb_dim\"])\n", " self.pos_emb = nn.Embedding(cfg[\"context_length\"], cfg[\"emb_dim\"])\n", " self.drop_emb = nn.Dropout(cfg[\"drop_rate\"])\n", "\n", " self.trf_blocks = nn.Sequential(\n", " *[TransformerBlock(cfg) for _ in range(cfg[\"n_layers\"])])\n", "\n", " self.final_norm = LayerNorm(cfg[\"emb_dim\"])\n", " self.out_head = nn.Linear(\n", " cfg[\"emb_dim\"], cfg[\"vocab_size\"], bias=False\n", " )\n", "\n", " def forward(self, in_idx):\n", " batch_size, seq_len = in_idx.shape\n", " tok_embeds = self.tok_emb(in_idx)\n", " pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))\n", " x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]\n", " x = self.drop_emb(x)\n", " x = self.trf_blocks(x)\n", " x = self.final_norm(x)\n", " logits = self.out_head(x)\n", " return logits" ], "metadata": { "id": "viShjjIF068N" }, "id": "viShjjIF068N", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "def assign(left, right):\n", " if left.shape != right.shape:\n", " raise ValueError(f\"Shape mismatch. Left: {left.shape}, Right: {right.shape}\")\n", " return torch.nn.Parameter(torch.tensor(right))\n", "\n", "def load_weights_into_gpt(gpt, params):\n", " gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])\n", " gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])\n", "\n", " for b in range(len(params[\"blocks\"])):\n", " q_w, k_w, v_w = np.split(\n", " (params[\"blocks\"][b][\"attn\"][\"c_attn\"])[\"w\"], 3, axis=-1)\n", " gpt.trf_blocks[b].att.W_query.weight = assign(\n", " gpt.trf_blocks[b].att.W_query.weight, q_w.T)\n", " gpt.trf_blocks[b].att.W_key.weight = assign(\n", " gpt.trf_blocks[b].att.W_key.weight, k_w.T)\n", " gpt.trf_blocks[b].att.W_value.weight = assign(\n", " gpt.trf_blocks[b].att.W_value.weight, v_w.T)\n", "\n", " q_b, k_b, v_b = np.split(\n", " (params[\"blocks\"][b][\"attn\"][\"c_attn\"])[\"b\"], 3, axis=-1)\n", " gpt.trf_blocks[b].att.W_query.bias = assign(\n", " gpt.trf_blocks[b].att.W_query.bias, q_b)\n", " gpt.trf_blocks[b].att.W_key.bias = assign(\n", " gpt.trf_blocks[b].att.W_key.bias, k_b)\n", " gpt.trf_blocks[b].att.W_value.bias = assign(\n", " gpt.trf_blocks[b].att.W_value.bias, v_b)\n", "\n", " gpt.trf_blocks[b].att.out_proj.weight = assign(\n", " gpt.trf_blocks[b].att.out_proj.weight,\n", " params[\"blocks\"][b][\"attn\"][\"c_proj\"][\"w\"].T)\n", " gpt.trf_blocks[b].att.out_proj.bias = assign(\n", " gpt.trf_blocks[b].att.out_proj.bias,\n", " params[\"blocks\"][b][\"attn\"][\"c_proj\"][\"b\"])\n", "\n", " gpt.trf_blocks[b].ff.layers[0].weight = assign(\n", " gpt.trf_blocks[b].ff.layers[0].weight,\n", " params[\"blocks\"][b][\"mlp\"][\"c_fc\"][\"w\"].T)\n", " gpt.trf_blocks[b].ff.layers[0].bias = assign(\n", " gpt.trf_blocks[b].ff.layers[0].bias,\n", " params[\"blocks\"][b][\"mlp\"][\"c_fc\"][\"b\"])\n", " gpt.trf_blocks[b].ff.layers[2].weight = assign(\n", " gpt.trf_blocks[b].ff.layers[2].weight,\n", " params[\"blocks\"][b][\"mlp\"][\"c_proj\"][\"w\"].T)\n", " gpt.trf_blocks[b].ff.layers[2].bias = assign(\n", " gpt.trf_blocks[b].ff.layers[2].bias,\n", " params[\"blocks\"][b][\"mlp\"][\"c_proj\"][\"b\"])\n", "\n", " gpt.trf_blocks[b].norm1.scale = assign(\n", " gpt.trf_blocks[b].norm1.scale,\n", " params[\"blocks\"][b][\"ln_1\"][\"g\"])\n", " gpt.trf_blocks[b].norm1.shift = assign(\n", " gpt.trf_blocks[b].norm1.shift,\n", " params[\"blocks\"][b][\"ln_1\"][\"b\"])\n", " gpt.trf_blocks[b].norm2.scale = assign(\n", " gpt.trf_blocks[b].norm2.scale,\n", " params[\"blocks\"][b][\"ln_2\"][\"g\"])\n", " gpt.trf_blocks[b].norm2.shift = assign(\n", " gpt.trf_blocks[b].norm2.shift,\n", " params[\"blocks\"][b][\"ln_2\"][\"b\"])\n", "\n", " gpt.final_norm.scale = assign(gpt.final_norm.scale, params[\"g\"])\n", " gpt.final_norm.shift = assign(gpt.final_norm.shift, params[\"b\"])\n", " gpt.out_head.weight = assign(gpt.out_head.weight, params[\"wte\"])\n" ], "metadata": { "id": "6MN5g7ah1rhB" }, "id": "6MN5g7ah1rhB", "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "id": "022a649a-44f5-466c-8a8e-326c063384f5", "metadata": { "id": "022a649a-44f5-466c-8a8e-326c063384f5" }, "outputs": [], "source": [ "model_size = CHOOSE_MODEL.split(\" \")[-1].lstrip(\"(\").rstrip(\")\")\n", "settings, params = download_and_load_gpt2(model_size=model_size, models_dir=\"gpt2\")\n", "\n", "model = GPTModel(BASE_CONFIG)\n", "load_weights_into_gpt(model, params)\n", "model.eval();" ] }, { "cell_type": "markdown", "id": "ab8e056c-abe0-415f-b34d-df686204259e", "metadata": { "id": "ab8e056c-abe0-415f-b34d-df686204259e" }, "source": [ "- To ensure that the model was loaded correctly, let's double-check that it generates coherent text" ] }, { "cell_type": "code", "execution_count": null, "id": "d8ac25ff-74b1-4149-8dc5-4c429d464330", "metadata": { "id": "d8ac25ff-74b1-4149-8dc5-4c429d464330" }, "outputs": [], "source": [ "def generate_text_simple(model, idx, max_new_tokens, context_size):\n", " # idx is (batch, n_tokens) array of indices in the current context\n", " for _ in range(max_new_tokens):\n", "\n", " # Crop current context if it exceeds the supported context size\n", " # E.g., if LLM supports only 5 tokens, and the context size is 10\n", " # then only the last 5 tokens are used as context\n", " idx_cond = idx[:, -context_size:]\n", "\n", " # Get the predictions\n", " with torch.no_grad():\n", " logits = model(idx_cond)\n", "\n", " # Focus only on the last time step\n", " # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)\n", " logits = logits[:, -1, :]\n", "\n", " # Apply softmax to get probabilities\n", " probas = torch.softmax(logits, dim=-1) # (batch, vocab_size)\n", "\n", " # Get the idx of the vocab entry with the highest probability value\n", " idx_next = torch.argmax(probas, dim=-1, keepdim=True) # (batch, 1)\n", "\n", " # Append sampled index to the running sequence\n", " idx = torch.cat((idx, idx_next), dim=1) # (batch, n_tokens+1)\n", "\n", " return idx\n", "\n", "def text_to_token_ids(text, tokenizer):\n", " encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})\n", " encoded_tensor = torch.tensor(encoded).unsqueeze(0) # add batch dimension\n", " return encoded_tensor\n", "\n", "def token_ids_to_text(token_ids, tokenizer):\n", " flat = token_ids.squeeze(0) # remove batch dimension\n", " return tokenizer.decode(flat.tolist())\n", "\n", "\n", "text_1 = \"Every effort moves you\"\n", "\n", "token_ids = generate_text_simple(\n", " model=model,\n", " idx=text_to_token_ids(text_1, tokenizer),\n", " max_new_tokens=15,\n", " context_size=BASE_CONFIG[\"context_length\"]\n", ")\n", "\n", "print(token_ids_to_text(token_ids, tokenizer))" ] }, { "cell_type": "markdown", "id": "69162550-6a02-4ece-8db1-06c71d61946f", "metadata": { "id": "69162550-6a02-4ece-8db1-06c71d61946f" }, "source": [ "- Before we finetune the model as a classifier, let's see if the model can perhaps already classify spam messages via prompting" ] }, { "cell_type": "code", "execution_count": null, "id": "94224aa9-c95a-4f8a-a420-76d01e3a800c", "metadata": { "id": "94224aa9-c95a-4f8a-a420-76d01e3a800c" }, "outputs": [], "source": [ "text_2 = (\n", " \"Is the following text 'spam'? Answer with 'yes' or 'no':\"\n", " \" 'You are a winner you have been specially\"\n", " \" selected to receive $1000 cash or a $2000 award.'\"\n", ")\n", "\n", "token_ids = generate_text_simple(\n", " model=model,\n", " idx=text_to_token_ids(text_2, tokenizer),\n", " max_new_tokens=23,\n", " context_size=BASE_CONFIG[\"context_length\"]\n", ")\n", "\n", "print(token_ids_to_text(token_ids, tokenizer))" ] }, { "cell_type": "markdown", "id": "1ce39ed0-2c77-410d-8392-dd15d4b22016", "metadata": { "id": "1ce39ed0-2c77-410d-8392-dd15d4b22016" }, "source": [ "- As we can see, the model is not very good at following instructions\n", "- This is expected, since it has only been pretrained" ] }, { "cell_type": "markdown", "id": "4c9ae440-32f9-412f-96cf-fd52cc3e2522", "metadata": { "id": "4c9ae440-32f9-412f-96cf-fd52cc3e2522" }, "source": [ "## Adding a classification head" ] }, { "cell_type": "markdown", "id": "d6e9d66f-76b2-40fc-9ec5-3f972a8db9c0", "metadata": { "id": "d6e9d66f-76b2-40fc-9ec5-3f972a8db9c0" }, "source": [ "" ] }, { "cell_type": "markdown", "id": "217bac05-78df-4412-bd80-612f8061c01d", "metadata": { "id": "217bac05-78df-4412-bd80-612f8061c01d" }, "source": [ "- In this section, we are modifying the pretrained LLM to make it ready for classification finetuning\n", "- Let's take a look at the model architecture first" ] }, { "cell_type": "code", "execution_count": null, "id": "b23aff91-6bd0-48da-88f6-353657e6c981", "metadata": { "id": "b23aff91-6bd0-48da-88f6-353657e6c981" }, "outputs": [], "source": [ "print(model)" ] }, { "cell_type": "markdown", "id": "3f640a76-dd00-4769-9bc8-1aed0cec330d", "metadata": { "id": "3f640a76-dd00-4769-9bc8-1aed0cec330d" }, "source": [ "- Above, we can see the architecture we implemented in gpt2.ipynb (from previous class) neatly laid out\n", "- The goal is to replace and finetune the output layer\n", "- To achieve this, we first freeze the model, meaning that we make all layers non-trainable" ] }, { "cell_type": "code", "execution_count": null, "id": "fkMWFl-0etea", "metadata": { "id": "fkMWFl-0etea" }, "outputs": [], "source": [ "for param in model.parameters():\n", " param.requires_grad = False" ] }, { "cell_type": "markdown", "id": "72155f83-87d9-476a-a978-a15aa2d44147", "metadata": { "id": "72155f83-87d9-476a-a978-a15aa2d44147" }, "source": [ "- Then, we replace the output layer (`model.out_head`), which originally maps the layer inputs to 50,257 dimensions (the size of the vocabulary)\n", "- Since we finetune the model for binary classification (predicting 2 classes, \"spam\" and \"not spam\"), we can replace the output layer as shown below, which will be trainable by default\n", "- Note that we use `BASE_CONFIG[\"emb_dim\"]` (which is equal to 768 in the `\"gpt2-small (124M)\"` model) to keep the code below more general" ] }, { "cell_type": "code", "execution_count": null, "id": "7e759fa0-0f69-41be-b576-17e5f20e04cb", "metadata": { "id": "7e759fa0-0f69-41be-b576-17e5f20e04cb" }, "outputs": [], "source": [ "torch.manual_seed(123)\n", "\n", "num_classes = 2\n", "model.out_head = torch.nn.Linear(in_features=BASE_CONFIG[\"emb_dim\"], out_features=num_classes)" ] }, { "cell_type": "markdown", "id": "30be5475-ae77-4f97-8f3e-dec462b1339f", "metadata": { "id": "30be5475-ae77-4f97-8f3e-dec462b1339f" }, "source": [ "- Technically, it's sufficient to only train the output layer\n", "- Experiments in [Finetuning Large Language Models](https://magazine.sebastianraschka.com/p/finetuning-large-language-models) show that finetuning additional layers can noticeably improve the performance\n", "- So, we are also making the last transformer block and the final `LayerNorm` module connecting the last transformer block to the output layer trainable" ] }, { "cell_type": "markdown", "id": "0be7c1eb-c46c-4065-8525-eea1b8c66d10", "metadata": { "id": "0be7c1eb-c46c-4065-8525-eea1b8c66d10" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "2aedc120-5ee3-48f6-92f2-ad9304ebcdc7", "metadata": { "id": "2aedc120-5ee3-48f6-92f2-ad9304ebcdc7" }, "outputs": [], "source": [ "for param in model.trf_blocks[-1].parameters():\n", " param.requires_grad = True\n", "\n", "for param in model.final_norm.parameters():\n", " param.requires_grad = True" ] }, { "cell_type": "markdown", "id": "f012b899-8284-4d3a-97c0-8a48eb33ba2e", "metadata": { "id": "f012b899-8284-4d3a-97c0-8a48eb33ba2e" }, "source": [ "- We can still use this model similar to before in gpt2.ipynb (from previous class)\n", "- For example, let's feed it some text input" ] }, { "cell_type": "code", "execution_count": null, "id": "f645c06a-7df6-451c-ad3f-eafb18224ebc", "metadata": { "id": "f645c06a-7df6-451c-ad3f-eafb18224ebc" }, "outputs": [], "source": [ "inputs = tokenizer.encode(\"Do you have time\")\n", "inputs = torch.tensor(inputs).unsqueeze(0)\n", "print(\"Inputs:\", inputs)\n", "print(\"Inputs dimensions:\", inputs.shape) # shape: (batch_size, num_tokens)" ] }, { "cell_type": "code", "execution_count": null, "id": "48dc84f1-85cc-4609-9cee-94ff539f00f4", "metadata": { "id": "48dc84f1-85cc-4609-9cee-94ff539f00f4" }, "outputs": [], "source": [ "with torch.no_grad():\n", " outputs = model(inputs)\n", "\n", "print(\"Outputs:\\n\", outputs)\n", "print(\"Outputs dimensions:\", outputs.shape) # shape: (batch_size, num_tokens, num_classes)" ] }, { "cell_type": "markdown", "id": "75430a01-ef9c-426a-aca0-664689c4f461", "metadata": { "id": "75430a01-ef9c-426a-aca0-664689c4f461" }, "source": [ "- For each input token, there's one output vector\n", "- Since we fed the model a text sample with 4 input tokens, the output consists of 4 2-dimensional output vectors above" ] }, { "cell_type": "markdown", "id": "7df9144f-6817-4be4-8d4b-5d4dadfe4a9b", "metadata": { "id": "7df9144f-6817-4be4-8d4b-5d4dadfe4a9b" }, "source": [ "" ] }, { "cell_type": "markdown", "id": "e3bb8616-c791-4f5c-bac0-5302f663e46a", "metadata": { "id": "e3bb8616-c791-4f5c-bac0-5302f663e46a" }, "source": [ "- We have learnt about the attention mechanism, which connects each input token to each other input token\n", "- We then also introduced the causal attention mask that is used in GPT-like models; this causal mask lets a current token only attend to the current and previous token positions\n", "- Based on this causal attention mechanism, the 4th (last) token contains the most information among all tokens because it's the only token that includes information about all other tokens\n", "- Hence, we are particularly interested in this last token, which we will finetune for the spam classification task" ] }, { "cell_type": "code", "execution_count": null, "id": "49383a8c-41d5-4dab-98f1-238bca0c2ed7", "metadata": { "id": "49383a8c-41d5-4dab-98f1-238bca0c2ed7" }, "outputs": [], "source": [ "print(\"Last output token:\", outputs[:, -1, :])" ] }, { "cell_type": "markdown", "id": "8df08ae0-e664-4670-b7c5-8a2280d9b41b", "metadata": { "id": "8df08ae0-e664-4670-b7c5-8a2280d9b41b" }, "source": [ "" ] }, { "cell_type": "markdown", "id": "32aa4aef-e1e9-491b-9adf-5aa973e59b8c", "metadata": { "id": "32aa4aef-e1e9-491b-9adf-5aa973e59b8c" }, "source": [ "## Calculating the classification loss and accuracy" ] }, { "cell_type": "markdown", "id": "669e1fd1-ace8-44b4-b438-185ed0ba8b33", "metadata": { "id": "669e1fd1-ace8-44b4-b438-185ed0ba8b33" }, "source": [ "" ] }, { "cell_type": "markdown", "id": "7a7df4ee-0a34-4a4d-896d-affbbf81e0b3", "metadata": { "id": "7a7df4ee-0a34-4a4d-896d-affbbf81e0b3" }, "source": [ "- Before explaining the loss calculation, let's have a brief look at how the model outputs are turned into class labels" ] }, { "cell_type": "markdown", "id": "557996dd-4c6b-49c4-ab83-f60ef7e1d69e", "metadata": { "id": "557996dd-4c6b-49c4-ab83-f60ef7e1d69e" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "c77faab1-3461-4118-866a-6171f2b89aa0", "metadata": { "id": "c77faab1-3461-4118-866a-6171f2b89aa0" }, "outputs": [], "source": [ "print(\"Last output token:\", outputs[:, -1, :])" ] }, { "cell_type": "markdown", "id": "7edd71fa-628a-4d00-b81d-6d8bcb2c341d", "metadata": { "id": "7edd71fa-628a-4d00-b81d-6d8bcb2c341d" }, "source": [ "- We convert the outputs (logits) into probability scores via the `softmax` function and then obtain the index position of the largest probability value via the `argmax` function" ] }, { "cell_type": "code", "execution_count": null, "id": "b81efa92-9be1-4b9e-8790-ce1fc7b17f01", "metadata": { "id": "b81efa92-9be1-4b9e-8790-ce1fc7b17f01" }, "outputs": [], "source": [ "probas = torch.softmax(outputs[:, -1, :], dim=-1)\n", "label = torch.argmax(probas)\n", "print(\"Class label:\", label.item())" ] }, { "cell_type": "markdown", "id": "414a6f02-307e-4147-a416-14d115bf8179", "metadata": { "id": "414a6f02-307e-4147-a416-14d115bf8179" }, "source": [ "- Note that the softmax function is optional here, because the largest outputs correspond to the largest probability scores" ] }, { "cell_type": "code", "execution_count": null, "id": "f9f9ad66-4969-4501-8239-3ccdb37e71a2", "metadata": { "id": "f9f9ad66-4969-4501-8239-3ccdb37e71a2" }, "outputs": [], "source": [ "logits = outputs[:, -1, :]\n", "label = torch.argmax(logits)\n", "print(\"Class label:\", label.item())" ] }, { "cell_type": "markdown", "id": "dcb20d3a-cbba-4ab1-8584-d94e16589505", "metadata": { "id": "dcb20d3a-cbba-4ab1-8584-d94e16589505" }, "source": [ "- We can apply this concept to calculate the so-called classification accuracy, which computes the percentage of correct predictions in a given dataset\n", "- To calculate the classification accuracy, we can apply the preceding `argmax`-based prediction code to all examples in a dataset and calculate the fraction of correct predictions as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "3ecf9572-aed0-4a21-9c3b-7f9f2aec5f23", "metadata": { "id": "3ecf9572-aed0-4a21-9c3b-7f9f2aec5f23" }, "outputs": [], "source": [ "def calc_accuracy_loader(data_loader, model, device, num_batches=None):\n", " model.eval()\n", " correct_predictions, num_examples = 0, 0\n", "\n", " if num_batches is None:\n", " num_batches = len(data_loader)\n", " else:\n", " num_batches = min(num_batches, len(data_loader))\n", " for i, (input_batch, target_batch) in enumerate(data_loader):\n", " if i < num_batches:\n", " input_batch, target_batch = input_batch.to(device), target_batch.to(device)\n", "\n", " with torch.no_grad():\n", " logits = model(input_batch)[:, -1, :] # Logits of last output token\n", " predicted_labels = torch.argmax(logits, dim=-1)\n", "\n", " num_examples += predicted_labels.shape[0]\n", " correct_predictions += (predicted_labels == target_batch).sum().item()\n", " else:\n", " break\n", " return correct_predictions / num_examples" ] }, { "cell_type": "markdown", "id": "7165fe46-a284-410b-957f-7524877d1a1a", "metadata": { "id": "7165fe46-a284-410b-957f-7524877d1a1a" }, "source": [ "- Let's apply the function to calculate the classification accuracies for the different datasets:" ] }, { "cell_type": "code", "execution_count": null, "id": "390e5255-8427-488c-adef-e1c10ab4fb26", "metadata": { "id": "390e5255-8427-488c-adef-e1c10ab4fb26" }, "outputs": [], "source": [ "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "\n", "# Note:\n", "# Uncommenting the following lines will allow the code to run on Apple Silicon chips, if applicable,\n", "# which is approximately 2x faster than on an Apple CPU (as measured on an M3 MacBook Air).\n", "# As of this writing, in PyTorch 2.4, the results obtained via CPU and MPS were identical.\n", "# However, in earlier versions of PyTorch, you may observe different results when using MPS.\n", "\n", "#if torch.cuda.is_available():\n", "# device = torch.device(\"cuda\")\n", "#elif torch.backends.mps.is_available():\n", "# device = torch.device(\"mps\")\n", "#else:\n", "# device = torch.device(\"cpu\")\n", "#print(f\"Running on {device} device.\")\n", "\n", "model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes\n", "\n", "torch.manual_seed(123) # For reproducibility due to the shuffling in the training data loader\n", "\n", "train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)\n", "val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)\n", "test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)\n", "\n", "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n", "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n", "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")" ] }, { "cell_type": "markdown", "id": "30345e2a-afed-4d22-9486-f4010f90a871", "metadata": { "id": "30345e2a-afed-4d22-9486-f4010f90a871" }, "source": [ "- As we can see, the prediction accuracies are not very good, since we haven't finetuned the model, yet" ] }, { "cell_type": "markdown", "id": "4f4a9d15-8fc7-48a2-8734-d92a2f265328", "metadata": { "id": "4f4a9d15-8fc7-48a2-8734-d92a2f265328" }, "source": [ "- Before we can start finetuning (/training), we first have to define the loss function we want to optimize during training\n", "- The goal is to maximize the spam classification accuracy of the model; however, classification accuracy is not a differentiable function\n", "- Hence, instead, we minimize the cross-entropy loss as a proxy for maximizing the classification accuracy\n", "- Here, we are only interested in optimizing the last token `model(input_batch)[:, -1, :]` instead of all tokens `model(input_batch)` in the `calc_loss_batch` function" ] }, { "cell_type": "code", "execution_count": null, "id": "2f1e9547-806c-41a9-8aba-3b2822baabe4", "metadata": { "id": "2f1e9547-806c-41a9-8aba-3b2822baabe4" }, "outputs": [], "source": [ "def calc_loss_batch(input_batch, target_batch, model, device):\n", " input_batch, target_batch = input_batch.to(device), target_batch.to(device)\n", " logits = model(input_batch)[:, -1, :] # Logits of last output token\n", " loss = torch.nn.functional.cross_entropy(logits, target_batch)\n", " return loss" ] }, { "cell_type": "code", "execution_count": null, "id": "b7b83e10-5720-45e7-ac5e-369417ca846b", "metadata": { "id": "b7b83e10-5720-45e7-ac5e-369417ca846b" }, "outputs": [], "source": [ "def calc_loss_loader(data_loader, model, device, num_batches=None):\n", " total_loss = 0.\n", " if len(data_loader) == 0:\n", " return float(\"nan\")\n", " elif num_batches is None:\n", " num_batches = len(data_loader)\n", " else:\n", " # Reduce the number of batches to match the total number of batches in the data loader\n", " # if num_batches exceeds the number of batches in the data loader\n", " num_batches = min(num_batches, len(data_loader))\n", " for i, (input_batch, target_batch) in enumerate(data_loader):\n", " if i < num_batches:\n", " loss = calc_loss_batch(input_batch, target_batch, model, device)\n", " total_loss += loss.item()\n", " else:\n", " break\n", " return total_loss / num_batches" ] }, { "cell_type": "markdown", "id": "56826ecd-6e74-40e6-b772-d3541e585067", "metadata": { "id": "56826ecd-6e74-40e6-b772-d3541e585067" }, "source": [ "- Using the `calc_closs_loader`, we compute the initial training, validation, and test set losses before we start training" ] }, { "cell_type": "code", "execution_count": null, "id": "f6f00e53-5beb-4e64-b147-f26fd481c6ff", "metadata": { "id": "f6f00e53-5beb-4e64-b147-f26fd481c6ff" }, "outputs": [], "source": [ "with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yet\n", " train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)\n", " val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)\n", " test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)\n", "\n", "print(f\"Training loss: {train_loss:.3f}\")\n", "print(f\"Validation loss: {val_loss:.3f}\")\n", "print(f\"Test loss: {test_loss:.3f}\")" ] }, { "cell_type": "markdown", "id": "e04b980b-e583-4f62-84a0-4edafaf99d5d", "metadata": { "id": "e04b980b-e583-4f62-84a0-4edafaf99d5d" }, "source": [ "- In the next section, we train the model to improve the loss values and consequently the classification accuracy" ] }, { "cell_type": "markdown", "id": "456ae0fd-6261-42b4-ab6a-d24289953083", "metadata": { "id": "456ae0fd-6261-42b4-ab6a-d24289953083" }, "source": [ "## Finetuning the model on supervised data" ] }, { "cell_type": "markdown", "id": "6a9b099b-0829-4f72-8a2b-4363e3497026", "metadata": { "id": "6a9b099b-0829-4f72-8a2b-4363e3497026" }, "source": [ "- In this section, we define and use the training function to improve the classification accuracy of the model\n", "- The `train_classifier_simple` function below is practically the same as the `train_model_simple` function we used for pretraining the model in the gpt2.ipynb notebook\n", "- The only two differences are that we now\n", " 1. track the number of training examples seen (`examples_seen`) instead of the number of tokens seen\n", " 2. calculate the accuracy after each epoch instead of printing a sample text after each epoch" ] }, { "cell_type": "markdown", "id": "979b6222-1dc2-4530-9d01-b6b04fe3de12", "metadata": { "id": "979b6222-1dc2-4530-9d01-b6b04fe3de12" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "Csbr60to50FL", "metadata": { "id": "Csbr60to50FL" }, "outputs": [], "source": [ "def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,\n", " eval_freq, eval_iter):\n", " # Initialize lists to track losses and examples seen\n", " train_losses, val_losses, train_accs, val_accs = [], [], [], []\n", " examples_seen, global_step = 0, -1\n", "\n", " # Main training loop\n", " for epoch in range(num_epochs):\n", " model.train() # Set model to training mode\n", "\n", " for input_batch, target_batch in train_loader:\n", " optimizer.zero_grad() # Reset loss gradients from previous batch iteration\n", " loss = calc_loss_batch(input_batch, target_batch, model, device)\n", " loss.backward() # Calculate loss gradients\n", " optimizer.step() # Update model weights using loss gradients\n", " examples_seen += input_batch.shape[0] # New: track examples instead of tokens\n", " global_step += 1\n", "\n", " # Optional evaluation step\n", " if global_step % eval_freq == 0:\n", " train_loss, val_loss = evaluate_model(\n", " model, train_loader, val_loader, device, eval_iter)\n", " train_losses.append(train_loss)\n", " val_losses.append(val_loss)\n", " print(f\"Ep {epoch+1} (Step {global_step:06d}): \"\n", " f\"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}\")\n", "\n", " # Calculate accuracy after each epoch\n", " train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=eval_iter)\n", " val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=eval_iter)\n", " print(f\"Training accuracy: {train_accuracy*100:.2f}% | \", end=\"\")\n", " print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n", " train_accs.append(train_accuracy)\n", " val_accs.append(val_accuracy)\n", "\n", " return train_losses, val_losses, train_accs, val_accs, examples_seen" ] }, { "cell_type": "markdown", "id": "9624cb30-3e3a-45be-b006-c00475b58ae8", "metadata": { "id": "9624cb30-3e3a-45be-b006-c00475b58ae8" }, "source": [ "- The `evaluate_model` function used in the `train_classifier_simple` is the same as the one we used in the gpt2.ipynb notebook" ] }, { "cell_type": "code", "execution_count": null, "id": "bcc7bc04-6aa6-4516-a147-460e2f466eab", "metadata": { "id": "bcc7bc04-6aa6-4516-a147-460e2f466eab" }, "outputs": [], "source": [ "def evaluate_model(model, train_loader, val_loader, device, eval_iter):\n", " model.eval()\n", " with torch.no_grad():\n", " train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)\n", " val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)\n", " model.train()\n", " return train_loss, val_loss" ] }, { "cell_type": "code", "execution_count": null, "id": "X7kU3aAj7vTJ", "metadata": { "id": "X7kU3aAj7vTJ" }, "outputs": [], "source": [ "import time\n", "\n", "start_time = time.time()\n", "\n", "torch.manual_seed(123)\n", "\n", "optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)\n", "\n", "num_epochs = 5\n", "train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(\n", " model, train_loader, val_loader, optimizer, device,\n", " num_epochs=num_epochs, eval_freq=50, eval_iter=5,\n", ")\n", "\n", "end_time = time.time()\n", "execution_time_minutes = (end_time - start_time) / 60\n", "print(f\"Training completed in {execution_time_minutes:.2f} minutes.\")" ] }, { "cell_type": "markdown", "id": "1261bf90-3ce7-4591-895a-044a05538f30", "metadata": { "id": "1261bf90-3ce7-4591-895a-044a05538f30" }, "source": [ "- We use matplotlib to plot the loss function for the training and validation set" ] }, { "cell_type": "code", "execution_count": null, "id": "cURgnDqdCeka", "metadata": { "id": "cURgnDqdCeka" }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "def plot_values(epochs_seen, examples_seen, train_values, val_values, label=\"loss\"):\n", " fig, ax1 = plt.subplots(figsize=(5, 3))\n", "\n", " # Plot training and validation loss against epochs\n", " ax1.plot(epochs_seen, train_values, label=f\"Training {label}\")\n", " ax1.plot(epochs_seen, val_values, linestyle=\"-.\", label=f\"Validation {label}\")\n", " ax1.set_xlabel(\"Epochs\")\n", " ax1.set_ylabel(label.capitalize())\n", " ax1.legend()\n", "\n", " # Create a second x-axis for examples seen\n", " ax2 = ax1.twiny() # Create a second x-axis that shares the same y-axis\n", " ax2.plot(examples_seen, train_values, alpha=0) # Invisible plot for aligning ticks\n", " ax2.set_xlabel(\"Examples seen\")\n", "\n", " fig.tight_layout() # Adjust layout to make room\n", " plt.savefig(f\"{label}-plot.pdf\")\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "OIqRt466DiGk", "metadata": { "id": "OIqRt466DiGk" }, "outputs": [], "source": [ "epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))\n", "examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))\n", "\n", "plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses)" ] }, { "cell_type": "markdown", "id": "dbd28174-1836-44ba-b6c0-7e0be774fadc", "metadata": { "id": "dbd28174-1836-44ba-b6c0-7e0be774fadc" }, "source": [ "- Above, based on the downward slope, we see that the model learns well\n", "- Furthermore, the fact that the training and validation loss are very close indicates that the model does not tend to overfit the training data\n", "- Similarly, we can plot the accuracy below" ] }, { "cell_type": "code", "execution_count": null, "id": "yz8BIsaF0TUo", "metadata": { "id": "yz8BIsaF0TUo" }, "outputs": [], "source": [ "epochs_tensor = torch.linspace(0, num_epochs, len(train_accs))\n", "examples_seen_tensor = torch.linspace(0, examples_seen, len(train_accs))\n", "\n", "plot_values(epochs_tensor, examples_seen_tensor, train_accs, val_accs, label=\"accuracy\")" ] }, { "cell_type": "markdown", "id": "90aba699-21bc-42de-a69c-99f370bb0363", "metadata": { "id": "90aba699-21bc-42de-a69c-99f370bb0363" }, "source": [ "- Based on the accuracy plot above, we can see that the model achieves a relatively high training and validation accuracy after epochs 4 and 5\n", "- However, we have to keep in mind that we specified `eval_iter=5` in the training function earlier, which means that we only estimated the training and validation set performances\n", "- We can compute the training, validation, and test set performances over the complete dataset as follows below" ] }, { "cell_type": "code", "execution_count": null, "id": "UHWaJFrjY0zW", "metadata": { "id": "UHWaJFrjY0zW" }, "outputs": [], "source": [ "train_accuracy = calc_accuracy_loader(train_loader, model, device)\n", "val_accuracy = calc_accuracy_loader(val_loader, model, device)\n", "test_accuracy = calc_accuracy_loader(test_loader, model, device)\n", "\n", "print(f\"Training accuracy: {train_accuracy*100:.2f}%\")\n", "print(f\"Validation accuracy: {val_accuracy*100:.2f}%\")\n", "print(f\"Test accuracy: {test_accuracy*100:.2f}%\")" ] }, { "cell_type": "markdown", "id": "6882649f-dc7b-401f-84d2-024ff79c74a1", "metadata": { "id": "6882649f-dc7b-401f-84d2-024ff79c74a1" }, "source": [ "- We can see that the training and validation set performances are practically identical\n", "- However, based on the slightly lower test set performance, we can see that the model overfits the training data to a very small degree, as well as the validation data that has been used for tweaking some of the hyperparameters, such as the learning rate\n", "- This is normal, however, and this gap could potentially be further reduced by increasing the model's dropout rate (`drop_rate`) or the `weight_decay` in the optimizer setting" ] }, { "cell_type": "markdown", "id": "a74d9ad7-3ec1-450e-8c9f-4fc46d3d5bb0", "metadata": { "id": "a74d9ad7-3ec1-450e-8c9f-4fc46d3d5bb0" }, "source": [ "## Using the LLM as a spam classifier" ] }, { "cell_type": "markdown", "id": "72ebcfa2-479e-408b-9cf0-7421f6144855", "metadata": { "id": "72ebcfa2-479e-408b-9cf0-7421f6144855" }, "source": [ "" ] }, { "cell_type": "markdown", "id": "fd5408e6-83e4-4e5a-8503-c2fba6073f31", "metadata": { "id": "fd5408e6-83e4-4e5a-8503-c2fba6073f31" }, "source": [ "- Finally, let's use the finetuned GPT model in action\n", "- The `classify_review` function below implements the data preprocessing steps similar to the `SpamDataset` we implemented earlier\n", "- Then, the function returns the predicted integer class label from the model and returns the corresponding class name" ] }, { "cell_type": "code", "execution_count": null, "id": "aHdn6xvL-IW5", "metadata": { "id": "aHdn6xvL-IW5" }, "outputs": [], "source": [ "def classify_review(text, model, tokenizer, device, max_length=None, pad_token_id=50256):\n", " model.eval()\n", "\n", " # Prepare inputs to the model\n", " input_ids = tokenizer.encode(text)\n", " supported_context_length = model.pos_emb.weight.shape[0]\n", " # Note: In the book, this was originally written as pos_emb.weight.shape[1] by mistake\n", " # It didn't break the code but would have caused unnecessary truncation (to 768 instead of 1024)\n", "\n", " # Truncate sequences if they too long\n", " input_ids = input_ids[:min(max_length, supported_context_length)]\n", "\n", " # Pad sequences to the longest sequence\n", " input_ids += [pad_token_id] * (max_length - len(input_ids))\n", " input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0) # add batch dimension\n", "\n", " # Model inference\n", " with torch.no_grad():\n", " logits = model(input_tensor)[:, -1, :] # Logits of the last output token\n", " predicted_label = torch.argmax(logits, dim=-1).item()\n", "\n", " # Return the classified result\n", " return \"spam\" if predicted_label == 1 else \"not spam\"" ] }, { "cell_type": "markdown", "id": "f29682d8-a899-4d9b-b973-f8d5ec68172c", "metadata": { "id": "f29682d8-a899-4d9b-b973-f8d5ec68172c" }, "source": [ "- Let's try it out on a few examples below" ] }, { "cell_type": "code", "execution_count": null, "id": "apU_pf51AWSV", "metadata": { "id": "apU_pf51AWSV" }, "outputs": [], "source": [ "text_1 = (\n", " \"You are a winner you have been specially\"\n", " \" selected to receive $1000 cash or a $2000 award.\"\n", ")\n", "\n", "print(classify_review(\n", " text_1, model, tokenizer, device, max_length=train_dataset.max_length\n", "))" ] }, { "cell_type": "code", "execution_count": null, "id": "1g5VTOo_Ajs5", "metadata": { "id": "1g5VTOo_Ajs5" }, "outputs": [], "source": [ "text_2 = (\n", " \"Hey, just wanted to check if we're still on\"\n", " \" for dinner tonight? Let me know!\"\n", ")\n", "\n", "print(classify_review(\n", " text_2, model, tokenizer, device, max_length=train_dataset.max_length\n", "))" ] }, { "cell_type": "markdown", "id": "bf736e39-0d47-40c1-8d18-1f716cf7a81e", "metadata": { "id": "bf736e39-0d47-40c1-8d18-1f716cf7a81e" }, "source": [ "- Finally, let's save the model in case we want to reuse the model later without having to train it again" ] }, { "cell_type": "code", "execution_count": null, "id": "mYnX-gI1CfQY", "metadata": { "id": "mYnX-gI1CfQY" }, "outputs": [], "source": [ "torch.save(model.state_dict(), \"review_classifier.pth\")" ] }, { "cell_type": "markdown", "id": "ba78cf7c-6b80-4f71-a50e-3ccc73839af6", "metadata": { "id": "ba78cf7c-6b80-4f71-a50e-3ccc73839af6" }, "source": [ "- Then, in a new session, we could load the model as follows" ] }, { "cell_type": "code", "execution_count": null, "id": "cc4e68a5-d492-493b-87ef-45c475f353f5", "metadata": { "id": "cc4e68a5-d492-493b-87ef-45c475f353f5" }, "outputs": [], "source": [ "model_state_dict = torch.load(\"review_classifier.pth\", map_location=device, weights_only=True)\n", "model.load_state_dict(model_state_dict)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }