TensorFlow_Keras_HLF_with_TFRecord.ipynb
Traininig the High Level Feature classifier with TensorFlow/Keras using data in TFRecord format¶
Tensorflow/Keras and TFRecord, HLF classifier This notebooks trains a dense neural network for the particle classifier using High Level Features. It uses TensorFlow/Keras on a single node. Data is read using TensorFlow from files in TFRecord format.
Credits: this notebook is taken with permission from the work:
- Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020)
- Code and data at:https://github.com/cerndb/SparkDLTrigger
- The model is a classifier implemented as a DNN
- Model input: 14 "high level features", described in Topology classification with deep learning to improve real-time event selection at the LHC
- Model output: 3 classes, "W + jet", "QCD", "$t\bar{t}$"
Create the Keras model¶
In [ ]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
In [2]:
tf.version.VERSION
Out[2]:
In [ ]:
# Check that we have a GPU available
tf.config.list_physical_devices('GPU')
In [ ]:
def create_model(nh_1, nh_2, nh_3):
## Create model
model = Sequential()
model.add(Dense(nh_1, input_shape=(14,), activation='relu'))
model.add(Dense(nh_2, activation='relu'))
model.add(Dense(nh_3, activation='relu'))
model.add(Dense(3, activation='softmax'))
## Compile model
optimizer = 'Adam'
loss = 'categorical_crossentropy'
model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])
return model
keras_model = create_model(50,20,10)
In [5]:
keras_model.summary()
Load data and train the Keras model¶
In [6]:
# Download the datasets from
# https://github.com/cerndb/SparkDLTrigger/tree/master/Data
#
# For CERN users, data is already available on EOS
PATH = "/eos/project/s/sparkdltrigger/public/"
folder = PATH + "testUndersampled_HLF_features.tfrecord"
files_test_dataset = tf.data.Dataset.list_files(folder + "/part-r*", shuffle=False)
# training dataset
folder = PATH + "trainUndersampled_HLF_features.tfrecord"
files_train_dataset = tf.data.Dataset.list_files(folder + "/part-r*", seed=4242)
In [7]:
test_dataset=tf.data.TFRecordDataset(files_test_dataset)
train_dataset=tf.data.TFRecordDataset(files_train_dataset)
In [8]:
# use for debug
# for record in test_dataset.take(1):
# print(record)
In [9]:
# Function to decode TF records into the required features and labels
def decode(serialized_example):
deser_features = tf.io.parse_single_example(
serialized_example,
# Defaults are not specified since both keys are required.
features={
'encoded_label': tf.io.FixedLenFeature((3), tf.float32),
'HLF_input': tf.io.FixedLenFeature((14), tf.float32),
})
return(deser_features['HLF_input'], deser_features['encoded_label'])
In [10]:
parsed_test_dataset=test_dataset.map(decode, num_parallel_calls=tf.data.experimental.AUTOTUNE)
parsed_train_dataset=train_dataset.map(decode, num_parallel_calls=tf.data.experimental.AUTOTUNE)
In [11]:
# Show and example of the parsed data
for record in parsed_test_dataset.take(1):
print(record)
In [12]:
# Tunables
shuffle_size = 100000
batch_size = 128
train=parsed_train_dataset.shuffle(shuffle_size)
train=train.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
train=train.cache()
train=train.batch(batch_size)
In [13]:
train
Out[13]:
In [14]:
test_batch_size = 10240
test=parsed_test_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
test=test.cache()
test=parsed_test_dataset.batch(test_batch_size)
In [15]:
# Train the Keras model
%time history = keras_model.fit(train, validation_data=test, epochs=5)
Performance metrics¶
In [16]:
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-darkgrid')
# Graph with loss vs. epoch
plt.figure()
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(loc='upper right')
plt.title("HLF classifier loss")
plt.show()
In [17]:
# Graph with accuracy vs. epoch
%matplotlib notebook
plt.figure()
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(loc='lower right')
plt.title("HLF classifier accuracy")
plt.show()
Confusion Matrix¶
In [18]:
y_pred = history.model.predict(test)
In [19]:
# extract the labels from parsed_test_dataset
%time y_true = np.stack([labels.numpy() for features,labels in parsed_test_dataset.__iter__()])
In [20]:
from sklearn.metrics import accuracy_score
print('Accuracy of the HLF classifier: {:.4f}'.format(
accuracy_score(np.argmax(y_true, axis=1),np.argmax(y_pred, axis=1))))
In [21]:
import seaborn as sns
from sklearn.metrics import confusion_matrix
labels_name = ['qcd', 'tt', 'wjets']
labels = [0,1,2]
cm = confusion_matrix(np.argmax(y_true, axis=1), np.argmax(y_pred, axis=1), labels=labels)
## Normalize CM
cm = cm / cm.astype(np.float).sum(axis=1)
fig, ax = plt.subplots()
ax = sns.heatmap(cm, annot=True, fmt='g')
ax.xaxis.set_ticklabels(labels_name)
ax.yaxis.set_ticklabels(labels_name)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()
ROC and AUC¶
In [22]:
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(3):
fpr[i], tpr[i], _ = roc_curve(y_true[:, i], y_pred[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
In [23]:
# Dictionary containign ROC-AUC for the three classes
roc_auc
Out[23]:
In [24]:
%matplotlib notebook
# Plot roc curve
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-darkgrid')
plt.figure()
plt.plot(fpr[0], tpr[0], lw=2,
label='HLF classifier (AUC) = %0.4f' % roc_auc[0])
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$tt$ selector')
plt.legend(loc="lower right")
plt.show()
In [ ]: