{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Traininig of the High Level Feature classifier with Pytorch Lightning\n", "\n", "**PyTorch Lightning, HLF classifier** This notebooks trains a dense neural network for the particle classifier using High Level Features. It uses Pytorch Lightning on a single node. Pandas is used to read the data and pass it to Lightning.\n", "\n", "Credits: this notebook is taken with permission from the work: \n", "- [Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020)](https://rdcu.be/b4Wk9) \n", "- Code and data at:https://github.com/cerndb/SparkDLTrigger\n", "- The model is a classifier implemented as a DNN\n", " - Model input: 14 \"high level features\", described in [ Topology classification with deep learning to improve real-time event selection at the LHC](https://link.springer.com/epdf/10.1007/s41781-019-0028-1?author_access_token=eTrqfrCuFIP2vF4nDLnFfPe4RwlQNchNByi7wbcMAY7NPT1w8XxcX1ECT83E92HWx9dJzh9T9_y5Vfi9oc80ZXe7hp7PAj21GjdEF2hlNWXYAkFiNn--k5gFtNRj6avm0UukUt9M9hAH_j4UR7eR-g%3D%3D)\n", " - Model output: 3 classes, \"W + jet\", \"QCD\", \"$t\\bar{t}$\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load train and test datasets via Pandas" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Download the datasets from \n", "# https://github.com/cerndb/SparkDLTrigger/tree/master/Data\n", "#\n", "# For CERN users, data is already available on EOS\n", "PATH = \"/eos/project/s/sparkdltrigger/public/\"\n", "\n", "import pandas as pd\n", "\n", "testPDF = pd.read_parquet(path= PATH + 'testUndersampled_HLF_features.parquet', \n", " columns=['HLF_input', 'encoded_label'])\n", "\n", "trainPDF = pd.read_parquet(path= PATH + 'trainUndersampled_HLF_features.parquet', \n", " columns=['HLF_input', 'encoded_label'])" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are HLF_input 856090\n", "encoded_label 856090\n", "dtype: int64 events in the test dataset\n", "There are HLF_input 3426083\n", "encoded_label 3426083\n", "dtype: int64 events in the train dataset\n" ] } ], "source": [ "# Check the number of events in the train and test datasets\n", "\n", "num_test = testPDF.count()\n", "num_train = trainPDF.count()\n", "\n", "print('There are {} events in the test dataset'.format(num_test))\n", "print('There are {} events in the train dataset'.format(num_train))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HLF_inputencoded_label
0[0.015150733133517018, 0.003511028294205839, 0...[1.0, 0.0, 0.0]
1[0.0, 0.003881822832783805, 0.7166341448458555...[1.0, 0.0, 0.0]
2[0.009639073600865505, 0.0010022659022912096, ...[1.0, 0.0, 0.0]
3[0.016354407625436572, 0.002108937905084598, 0...[1.0, 0.0, 0.0]
4[0.01925979125354152, 0.004603697276827594, 0....[1.0, 0.0, 0.0]
.........
856085[0.020383967386165446, 0.0022348975484913444, ...[0.0, 1.0, 0.0]
856086[0.02475209699743233, 0.00867502196073073, 0.3...[0.0, 1.0, 0.0]
856087[0.03498179428310887, 0.02506331737284528, 0.9...[0.0, 1.0, 0.0]
856088[0.03735147362869153, 0.003645269183639405, 0....[0.0, 1.0, 0.0]
856089[0.04907273976147946, 0.003058462073646085, 0....[0.0, 1.0, 0.0]
\n", "

856090 rows × 2 columns

\n", "
" ], "text/plain": [ " HLF_input encoded_label\n", "0 [0.015150733133517018, 0.003511028294205839, 0... [1.0, 0.0, 0.0]\n", "1 [0.0, 0.003881822832783805, 0.7166341448458555... [1.0, 0.0, 0.0]\n", "2 [0.009639073600865505, 0.0010022659022912096, ... [1.0, 0.0, 0.0]\n", "3 [0.016354407625436572, 0.002108937905084598, 0... [1.0, 0.0, 0.0]\n", "4 [0.01925979125354152, 0.004603697276827594, 0.... [1.0, 0.0, 0.0]\n", "... ... ...\n", "856085 [0.020383967386165446, 0.0022348975484913444, ... [0.0, 1.0, 0.0]\n", "856086 [0.02475209699743233, 0.00867502196073073, 0.3... [0.0, 1.0, 0.0]\n", "856087 [0.03498179428310887, 0.02506331737284528, 0.9... [0.0, 1.0, 0.0]\n", "856088 [0.03735147362869153, 0.003645269183639405, 0.... [0.0, 1.0, 0.0]\n", "856089 [0.04907273976147946, 0.003058462073646085, 0.... [0.0, 1.0, 0.0]\n", "\n", "[856090 rows x 2 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show the schema and a data sample of the test dataset\n", "testPDF\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convert training and test datasets from Pandas DataFrames to Numpy arrays\n", "\n", "Now we will collect and convert the Pandas DataFrame into numpy arrays in order to be able to feed them to TensorFlow/Keras.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "X = np.stack(trainPDF[\"HLF_input\"]).astype(np.float32)\n", "y = np.stack(trainPDF[\"encoded_label\"]).astype(np.float32)\n", "\n", "X_test = np.stack(testPDF[\"HLF_input\"]).astype(np.float32)\n", "y_test = np.stack(testPDF[\"encoded_label\"]).astype(np.float32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create PyTorch Lightning model" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2.0.2'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "from torch.utils.data import TensorDataset, DataLoader\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "from torch.optim.lr_scheduler import StepLR\n", "import pytorch_lightning as pl\n", "\n", "torch.__version__\n", "pl.__version__" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "torch.cuda.is_available()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "class LightningNet(pl.LightningModule):\n", " def __init__(self, nh_1, nh_2, nh_3):\n", " super(LightningNet, self).__init__()\n", " self.train_loss_history = []\n", " self.val_loss_history = []\n", " self.train_acc_history = []\n", " self.val_acc_history = []\n", " self.fc1 = nn.Linear(14, nh_1)\n", " self.fc2 = nn.Linear(nh_1, nh_2)\n", " self.fc3 = nn.Linear(nh_2, nh_3)\n", " self.fc4 = nn.Linear(nh_3, 3)\n", " \n", " def forward(self, x):\n", " x = F.relu(self.fc1(x))\n", " x = F.relu(self.fc2(x))\n", " x = F.relu(self.fc3(x))\n", " output = F.softmax(self.fc4(x), dim=1)\n", " return output\n", "\n", " def training_step(self, batch, batch_idx):\n", " x, y = batch\n", " y_hat = self(x)\n", " loss = F.binary_cross_entropy_with_logits(y_hat, y)\n", " \n", " # Compute accuracy\n", " y_pred = torch.sigmoid(y_hat) > 0.5\n", " accuracy = (y_pred == y).sum().item() / (y.numel())\n", " \n", " self.log('train_loss', loss)\n", " self.log('train_acc', accuracy, prog_bar=True)\n", " \n", " return loss\n", " \n", " def validation_step(self, batch, batch_idx):\n", " x, y = batch\n", " y_hat = self(x)\n", " loss = F.binary_cross_entropy_with_logits(y_hat, y)\n", " \n", " # Compute accuracy\n", " y_pred = torch.sigmoid(y_hat) > 0.5\n", " accuracy = (y_pred == y).sum().item() / (y.numel())\n", " \n", " self.log('val_loss', loss)\n", " self.log('val_acc', accuracy, prog_bar=True)\n", " \n", " return loss\n", "\n", " def configure_optimizers(self):\n", " optimizer = torch.optim.Adam(self.parameters(), lr=0.001)\n", " return {\n", " 'optimizer': optimizer,\n", " 'lr_scheduler': {\n", " 'scheduler': StepLR(optimizer, step_size=1, gamma=0.7),\n", " 'interval': 'epoch'\n", " }\n", " }" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "class TensorDataModule(pl.LightningDataModule):\n", " def __init__(self, train_X, train_y, val_X=None, val_y=None, batch_size=32, num_workers=0):\n", " super().__init__()\n", " self.train_X = train_X\n", " self.train_y = train_y\n", " self.val_X = val_X\n", " self.val_y = val_y\n", " self.batch_size = batch_size\n", " self.num_workers = num_workers\n", "\n", " def setup(self, stage=None):\n", " self.train_dataset = TensorDataset(torch.tensor(self.train_X, dtype=torch.float32), torch.tensor(self.train_y, dtype=torch.float32))\n", "\n", " if self.val_X is not None and self.val_y is not None:\n", " self.val_dataset = TensorDataset(torch.tensor(self.val_X, dtype=torch.float32), torch.tensor(self.val_y, dtype=torch.float32))\n", " \n", " def train_dataloader(self):\n", " train_loader = DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True, num_workers=self.num_workers)\n", " return train_loader\n", "\n", " def val_dataloader(self):\n", " if hasattr(self, 'val_dataset'):\n", " val_loader = DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=False, num_workers=self.num_workers)\n", " return val_loader\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "class HistoryCallback(pl.Callback):\n", " def on_train_epoch_end(self, trainer, pl_module):\n", " # Extract training/validation loss and accuracy from the trainer object\n", " train_loss = trainer.callback_metrics['train_loss']\n", " val_loss = trainer.callback_metrics['val_loss']\n", " train_acc = trainer.callback_metrics['train_acc']\n", " val_acc = trainer.callback_metrics['val_acc']\n", " \n", " # Store the history\n", " pl_module.train_loss_history.append(train_loss.item())\n", " pl_module.val_loss_history.append(val_loss.item())\n", " pl_module.train_acc_history.append(train_acc.item())\n", " pl_module.val_acc_history.append(val_acc.item())\n", " \n", " def on_epoch_start(self, trainer, pl_module):\n", " super().on_epoch_start(trainer, pl_module)\n", " \n", " # Update the progress bar with metrics from previous epochs\n", " self.main_progress_bar.set_postfix(trainer.logger_connector.epoch_log_metrics)\n", " self.main_progress_bar.refresh()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "IPU available: False, using: 0 IPUs\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "\n", " | Name | Type | Params\n", "--------------------------------\n", "0 | fc1 | Linear | 750 \n", "1 | fc2 | Linear | 1.0 K \n", "2 | fc3 | Linear | 210 \n", "3 | fc4 | Linear | 33 \n", "--------------------------------\n", "2.0 K Trainable params\n", "0 Non-trainable params\n", "2.0 K Total params\n", "0.008 Total estimated model params size (MB)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Sanity Checking: 0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dfe8587b37124d8bbd6dab51fbabd8d7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Training: 0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Setup and run the training\n", "\n", "device = torch.device(\"cuda\")\n", "#device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", "\n", "num_epochs = 5\n", "train_batch_size = 128\n", "num_workers = 4\n", "torch.manual_seed(1)\n", "\n", "data_module = TensorDataModule(X, y, X_test, y_test, train_batch_size, num_workers)\n", "\n", "model = LightningNet(50, 20, 10)\n", "\n", "history_callback = HistoryCallback()\n", "\n", "trainer = pl.Trainer(max_epochs=num_epochs, callbacks=[history_callback])\n", "\n", "%time trainer.fit(model, data_module) \n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c45aa97a633e4245b320ea4a9d13e0b7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Validation: 0it [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n",
       "┃      Validate metric             DataLoader 0        ┃\n",
       "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n",
       "│          val_acc               0.828701376914978     │\n",
       "│         val_loss               0.596511721611023     │\n",
       "└───────────────────────────┴───────────────────────────┘\n",
       "
\n" ], "text/plain": [ "┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n", "┃\u001b[1m \u001b[0m\u001b[1m Validate metric \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1m DataLoader 0 \u001b[0m\u001b[1m \u001b[0m┃\n", "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n", "│\u001b[36m \u001b[0m\u001b[36m val_acc \u001b[0m\u001b[36m \u001b[0m│\u001b[35m \u001b[0m\u001b[35m 0.828701376914978 \u001b[0m\u001b[35m \u001b[0m│\n", "│\u001b[36m \u001b[0m\u001b[36m val_loss \u001b[0m\u001b[36m \u001b[0m│\u001b[35m \u001b[0m\u001b[35m 0.596511721611023 \u001b[0m\u001b[35m \u001b[0m│\n", "└───────────────────────────┴───────────────────────────┘\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[{'val_loss': 0.596511721611023, 'val_acc': 0.828701376914978}]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainer.validate(model, data_module)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance metrics" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/luca/ipykernel_12581/1690488990.py:3: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-