{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Traininig of the High Level Feature classifier with Pytorch Lightning\n", "\n", "**PyTorch Lightning, HLF classifier** This notebooks trains a dense neural network for the particle classifier using High Level Features. It uses Pytorch Lightning on a single node. Pandas is used to read the data and pass it to Lightning.\n", "\n", "Credits: this notebook is taken with permission from the work: \n", "- [Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020)](https://rdcu.be/b4Wk9) \n", "- Code and data at:https://github.com/cerndb/SparkDLTrigger\n", "- The model is a classifier implemented as a DNN\n", " - Model input: 14 \"high level features\", described in [ Topology classification with deep learning to improve real-time event selection at the LHC](https://link.springer.com/epdf/10.1007/s41781-019-0028-1?author_access_token=eTrqfrCuFIP2vF4nDLnFfPe4RwlQNchNByi7wbcMAY7NPT1w8XxcX1ECT83E92HWx9dJzh9T9_y5Vfi9oc80ZXe7hp7PAj21GjdEF2hlNWXYAkFiNn--k5gFtNRj6avm0UukUt9M9hAH_j4UR7eR-g%3D%3D)\n", " - Model output: 3 classes, \"W + jet\", \"QCD\", \"$t\\bar{t}$\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load train and test datasets via Pandas" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Download the datasets from \n", "# https://github.com/cerndb/SparkDLTrigger/tree/master/Data\n", "#\n", "# For CERN users, data is already available on EOS\n", "PATH = \"/eos/project/s/sparkdltrigger/public/\"\n", "\n", "import pandas as pd\n", "\n", "testPDF = pd.read_parquet(path= PATH + 'testUndersampled_HLF_features.parquet', \n", " columns=['HLF_input', 'encoded_label'])\n", "\n", "trainPDF = pd.read_parquet(path= PATH + 'trainUndersampled_HLF_features.parquet', \n", " columns=['HLF_input', 'encoded_label'])" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are HLF_input 856090\n", "encoded_label 856090\n", "dtype: int64 events in the test dataset\n", "There are HLF_input 3426083\n", "encoded_label 3426083\n", "dtype: int64 events in the train dataset\n" ] } ], "source": [ "# Check the number of events in the train and test datasets\n", "\n", "num_test = testPDF.count()\n", "num_train = trainPDF.count()\n", "\n", "print('There are {} events in the test dataset'.format(num_test))\n", "print('There are {} events in the train dataset'.format(num_train))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | HLF_input | \n", "encoded_label | \n", "
---|---|---|
0 | \n", "[0.015150733133517018, 0.003511028294205839, 0... | \n", "[1.0, 0.0, 0.0] | \n", "
1 | \n", "[0.0, 0.003881822832783805, 0.7166341448458555... | \n", "[1.0, 0.0, 0.0] | \n", "
2 | \n", "[0.009639073600865505, 0.0010022659022912096, ... | \n", "[1.0, 0.0, 0.0] | \n", "
3 | \n", "[0.016354407625436572, 0.002108937905084598, 0... | \n", "[1.0, 0.0, 0.0] | \n", "
4 | \n", "[0.01925979125354152, 0.004603697276827594, 0.... | \n", "[1.0, 0.0, 0.0] | \n", "
... | \n", "... | \n", "... | \n", "
856085 | \n", "[0.020383967386165446, 0.0022348975484913444, ... | \n", "[0.0, 1.0, 0.0] | \n", "
856086 | \n", "[0.02475209699743233, 0.00867502196073073, 0.3... | \n", "[0.0, 1.0, 0.0] | \n", "
856087 | \n", "[0.03498179428310887, 0.02506331737284528, 0.9... | \n", "[0.0, 1.0, 0.0] | \n", "
856088 | \n", "[0.03735147362869153, 0.003645269183639405, 0.... | \n", "[0.0, 1.0, 0.0] | \n", "
856089 | \n", "[0.04907273976147946, 0.003058462073646085, 0.... | \n", "[0.0, 1.0, 0.0] | \n", "
856090 rows × 2 columns
\n", "┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n", "┃ Validate metric ┃ DataLoader 0 ┃\n", "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n", "│ val_acc │ 0.828701376914978 │\n", "│ val_loss │ 0.596511721611023 │\n", "└───────────────────────────┴───────────────────────────┘\n", "\n" ], "text/plain": [ "┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n", "┃\u001b[1m \u001b[0m\u001b[1m Validate metric \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1m DataLoader 0 \u001b[0m\u001b[1m \u001b[0m┃\n", "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n", "│\u001b[36m \u001b[0m\u001b[36m val_acc \u001b[0m\u001b[36m \u001b[0m│\u001b[35m \u001b[0m\u001b[35m 0.828701376914978 \u001b[0m\u001b[35m \u001b[0m│\n", "│\u001b[36m \u001b[0m\u001b[36m val_loss \u001b[0m\u001b[36m \u001b[0m│\u001b[35m \u001b[0m\u001b[35m 0.596511721611023 \u001b[0m\u001b[35m \u001b[0m│\n", "└───────────────────────────┴───────────────────────────┘\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[{'val_loss': 0.596511721611023, 'val_acc': 0.828701376914978}]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainer.validate(model, data_module)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance metrics" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/luca/ipykernel_12581/1690488990.py:3: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-