TensorFlow_Keras_HLF_with_Pandas_Parquet.ipynb

Traininig of the High Level Feature classifier with TensorFlow/Keras¶

Tensorflow/Keras, HLF classifier This notebooks trains a dense neural network for a particle classifier using High Level Features. It uses TensorFlow/Keras on a single node. Pandas is used to read the data and pass it to TensorFlow via numpy arrays.

Credits: this notebook is taken with permission from the work:

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020)
Code and data at:https://github.com/cerndb/SparkDLTrigger
The model is a classifier implemented as a DNN
- Model input: 14 "high level features", described in Topology classification with deep learning to improve real-time event selection at the LHC
- Model output: 3 classes, "W + jet", "QCD", "$t\bar{t}$"

Load train and test datasets via Pandas¶

In [1]:

# Download the datasets from 
# ** https://github.com/cerndb/SparkDLTrigger/tree/master/Data **
#
# For CERN users, data is already available on EOS
PATH = "/eos/project/s/sparkdltrigger/public/"


import pandas as pd

testPDF = pd.read_parquet(path= PATH + 'testUndersampled_HLF_features.parquet', 
                          columns=['HLF_input', 'encoded_label'])

trainPDF = pd.read_parquet(path= PATH + 'trainUndersampled_HLF_features.parquet', 
                           columns=['HLF_input', 'encoded_label'])

In [2]:

# Check the number of events in the train and test datasets

num_test = testPDF.count()
num_train = trainPDF.count()

print('There are {} events in the test dataset'.format(num_test))
print('There are {} events in the train dataset'.format(num_train))

There are HLF_input        856090
encoded_label    856090
dtype: int64 events in the test dataset
There are HLF_input        3426083
encoded_label    3426083
dtype: int64 events in the train dataset

In [3]:

# Show the schema and a data sample of the test dataset
testPDF

Out[3]:

	HLF_input	encoded_label
0	[0.015150733133517018, 0.003511028294205839, 0...	[1.0, 0.0, 0.0]
1	[0.0, 0.003881822832783805, 0.7166341448458555...	[1.0, 0.0, 0.0]
2	[0.009639073600865505, 0.0010022659022912096, ...	[1.0, 0.0, 0.0]
3	[0.016354407625436572, 0.002108937905084598, 0...	[1.0, 0.0, 0.0]
4	[0.01925979125354152, 0.004603697276827594, 0....	[1.0, 0.0, 0.0]
...	...	...
856085	[0.020383967386165446, 0.0022348975484913444, ...	[0.0, 1.0, 0.0]
856086	[0.02475209699743233, 0.00867502196073073, 0.3...	[0.0, 1.0, 0.0]
856087	[0.03498179428310887, 0.02506331737284528, 0.9...	[0.0, 1.0, 0.0]
856088	[0.03735147362869153, 0.003645269183639405, 0....	[0.0, 1.0, 0.0]
856089	[0.04907273976147946, 0.003058462073646085, 0....	[0.0, 1.0, 0.0]

856090 rows × 2 columns

Convert training and test datasets from Pandas DataFrames to Numpy arrays¶

Now we will collect and convert the Pandas DataFrame into numpy arrays in order to be able to feed them to TensorFlow/Keras.

In [4]:

import numpy as np

X = np.stack(trainPDF["HLF_input"])
y = np.stack(trainPDF["encoded_label"])

X_test = np.stack(testPDF["HLF_input"])
y_test = np.stack(testPDF["encoded_label"])

Create the Keras model¶

In [ ]:

import tensorflow as tf
tf.__version__

In [ ]:

# Check that we have a GPU available
tf.config.list_physical_devices('GPU')

In [ ]:

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

def create_model(nh_1, nh_2, nh_3):
    ## Create model
    model = Sequential()
    model.add(Dense(nh_1, input_shape=(14,), activation='relu'))
    model.add(Dense(nh_2, activation='relu'))
    model.add(Dense(nh_3, activation='relu'))
    model.add(Dense(3, activation='softmax'))

    ## Compile model
    optimizer = 'Adam'
    loss = 'categorical_crossentropy'
    model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])

    return model

keras_model = create_model(50,20,10)

Train the model¶

In [8]:

batch_size = 128
n_epochs = 5

%time history = keras_model.fit(X, y, batch_size=batch_size, epochs=n_epochs, \
                                validation_data=(X_test, y_test))

Epoch 1/5

2023-05-15 21:39:35.542516: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x7f1444016f10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-05-15 21:39:35.542679: I tensorflow/compiler/xla/service/service.cc:177]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-05-15 21:39:35.583591: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-05-15 21:39:35.919975: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8600
2023-05-15 21:39:36.553056: I ./tensorflow/compiler/jit/device_compiler.h:180] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

26767/26767 [==============================] - 134s 5ms/step - loss: 0.2822 - accuracy: 0.8965 - val_loss: 0.2511 - val_accuracy: 0.9080
Epoch 2/5
26767/26767 [==============================] - 129s 5ms/step - loss: 0.2451 - accuracy: 0.9093 - val_loss: 0.2415 - val_accuracy: 0.9109
Epoch 3/5
26767/26767 [==============================] - 130s 5ms/step - loss: 0.2374 - accuracy: 0.9120 - val_loss: 0.2342 - val_accuracy: 0.9138
Epoch 4/5
26767/26767 [==============================] - 128s 5ms/step - loss: 0.2337 - accuracy: 0.9134 - val_loss: 0.2348 - val_accuracy: 0.9137
Epoch 5/5
26767/26767 [==============================] - 128s 5ms/step - loss: 0.2316 - accuracy: 0.9142 - val_loss: 0.2280 - val_accuracy: 0.9156
CPU times: user 11min 20s, sys: 2min 50s, total: 14min 10s
Wall time: 10min 49s

Performance metrics¶

In [9]:

%matplotlib notebook
import matplotlib.pyplot as plt 
plt.style.use('seaborn-darkgrid')
# Graph with loss vs. epoch

plt.figure()
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(loc='upper right')
plt.title("HLF classifier loss")
plt.show()

No description has been provided for this image

In [10]:

# Graph with accuracy vs. epoch
%matplotlib notebook
plt.figure()
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(loc='lower right')
plt.title("HLF classifier accuracy")
plt.show()

Confusion Matrix¶

In [11]:

y_pred=history.model.predict(X_test)
y_true=y_test

26753/26753 [==============================] - 46s 2ms/step

In [12]:

from sklearn.metrics import accuracy_score

print('Accuracy of the HLF classifier: {:.4f}'.format(
    accuracy_score(np.argmax(y_true, axis=1),np.argmax(y_pred, axis=1))))

Accuracy of the HLF classifier: 0.9156

In [13]:

import seaborn as sns
from sklearn.metrics import confusion_matrix
labels_name = ['qcd', 'tt', 'wjets']
labels = [0,1,2]

cm = confusion_matrix(np.argmax(y_true, axis=1), np.argmax(y_pred, axis=1), labels=labels)

## Normalize CM
cm = cm / cm.astype(np.float).sum(axis=1)

fig, ax = plt.subplots()
ax = sns.heatmap(cm, annot=True, fmt='g')
ax.xaxis.set_ticklabels(labels_name)
ax.yaxis.set_ticklabels(labels_name)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()

/tmp/ipykernel_6396/950292951.py:9: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  cm = cm / cm.astype(np.float).sum(axis=1)

ROC and AUC¶

In [14]:

from sklearn.metrics import roc_curve, auc

fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

In [15]:

# Dictionary containign ROC-AUC for the three classes 
roc_auc

Out[15]:

{0: 0.9872210857423231, 1: 0.9854308605088364, 2: 0.9813867488029954}

In [16]:

%matplotlib notebook

# Plot roc curve 
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')

plt.figure()
plt.plot(fpr[0], tpr[0], lw=2, \
         label='HLF classifier (AUC) = %0.4f' % roc_auc[0])
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$tt$ selector')
plt.legend(loc="lower right")
plt.show()

In [ ]: