TensorFlow_Inclusive_Classifier_GRU_TFRecord.ipynb
Traininig the Inclusive classifier with tf.keras using data in TFRecord format¶
tf.keras Inclusive classifier, GRU-based model This notebooks trains a neural network for a particle classifier using the Inclusive Classifier model, using as input the full list of reconstructed particles + the High Level Features. Data is prepared in TFRecord format converting from Parquet using Apache Spark. Tensorflow data procesing uses tf.data and tf.io.
Credits: this notebook is taken with permission from the work:
- Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020)
- Code and data at:https://github.com/cerndb/SparkDLTrigger
The model is a classifier implemented as Recurrent Neural Network
- input: 14 high-level features and an array of 801 particles with 19 low-level features, described in Topology classification with deep learning to improve real-time event selection at the LHC
- output: 3 classes, "W + jet", "QCD", "t tbar", see also Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020)
- Open dataset: download data
Create the Keras model for the inclusive classifier¶
In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import Sequential, Input, Model
from tensorflow.keras.layers import Masking, Dense, Activation, GRU, Dropout, concatenate
In [2]:
tf.version.VERSION
Out[2]:
In [3]:
# Check that we have a GPU available
tf.config.list_physical_devices('GPU')
Out[3]:
In [4]:
## GRU branch
gru_input = Input(shape=(801,19), name='gru_input')
a = gru_input
a = Masking(mask_value=0.)(a)
a = GRU(units=50,activation='tanh')(a)
gruBranch = Dropout(0.2)(a)
In [5]:
hlf_input = Input(shape=(14,), name='hlf_input')
b = hlf_input
hlfBranch = Dropout(0.2)(b)
In [6]:
c = concatenate([gruBranch, hlfBranch])
c = Dense(25, activation='relu')(c)
output = Dense(3, activation='softmax')(c)
In [7]:
model = Model(inputs=[gru_input, hlf_input], outputs=output)
In [8]:
## Compile model
optimizer = 'Adam'
loss = 'categorical_crossentropy'
model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"] )
In [9]:
model.summary()
Load test and training data in TFRecord format, using tf.data and tf.io¶
In [10]:
# Download the datasets from
# ** https://github.com/cerndb/SparkDLTrigger/tree/master/Data **
#
# For CERN users, data is already available on EOS
FOLDER = "/eos/project/s/sparkdltrigger/public/"
PATH = FOLDER + "testUndersampled_InclusiveClassifier.tfrecord"
files_test_dataset = tf.data.Dataset.list_files(PATH+"/part-r*", shuffle=False)
# training dataset
PATH = FOLDER + "trainUndersampled_InclusiveClassifier.tfrecord"
files_train_dataset = tf.data.Dataset.list_files(PATH+"/part-r*", seed=4242)
In [11]:
# tunable
num_parallel_reads=8
test_dataset = files_test_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE).interleave(
tf.data.TFRecordDataset,
cycle_length=num_parallel_reads,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
train_dataset = files_train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE).interleave(
tf.data.TFRecordDataset, cycle_length=num_parallel_reads,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
In [12]:
# Function to decode TFRecord data into the required features and labels
# In particular, GRU_input is stored has a flat array and needs to be resized as (801,19)
def decode(serialized_example):
deser_features = tf.io.parse_single_example(
serialized_example,
features={
'HLF_input': tf.io.FixedLenFeature((14), tf.float32),
'GRU_input': tf.io.FixedLenFeature((801,19), tf.float32),
'encoded_label': tf.io.FixedLenFeature((3), tf.float32),
})
return((deser_features['GRU_input'], deser_features['HLF_input']), deser_features['encoded_label'])
In [13]:
# use for debug
# for record in test_dataset.take(1):
# print(record)
In [14]:
parsed_test_dataset=test_dataset.map(decode, num_parallel_calls=tf.data.experimental.AUTOTUNE)
parsed_train_dataset=train_dataset.map(decode, num_parallel_calls=tf.data.experimental.AUTOTUNE)
In [15]:
# Show and example of the parsed data
# for record in parsed_test_dataset.take(1):
# print(record)
In [16]:
# tunable
batch_size=128
train=parsed_train_dataset.batch(batch_size)
train=train.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
train
Out[16]:
In [17]:
# tunable
test_batch_size = 10240
test=parsed_test_dataset.batch(batch_size)
test=test.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
Train the tf.keras model¶
In [18]:
# train the Keras model
# tunable
num_epochs = 6
# callbacks = [ tf.keras.callbacks.TensorBoard(log_dir='./logs') ]
callbacks = []
%time history = model.fit(train, validation_data=test, epochs=num_epochs, callbacks=callbacks)
In [ ]:
# Save the model
# tf.keras.models.save_model(model, "./myGRUmodel" + ".tf", save_format='tf')
Performance metrics¶
In [20]:
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-darkgrid')
# Graph with loss vs. epoch
plt.figure()
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(loc='upper right')
plt.title("HLF classifier loss")
plt.show()
In [21]:
# Graph with accuracy vs. epoch
%matplotlib notebook
plt.figure()
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(loc='lower right')
plt.title("HLF classifier accuracy")
plt.show()
Confusion Matrix¶
In [22]:
# model = tf.keras.models.load_model("./mymodel.tf")
In [23]:
%time model.evaluate(test)
Out[23]:
In [24]:
%time y_pred = model.predict(test)
In [25]:
%time y_true = np.stack([labels.numpy() for features,labels in parsed_test_dataset.__iter__()])
In [26]:
from sklearn.metrics import accuracy_score
print('Accuracy of the classifier: {:.4f}'.format(
accuracy_score(np.argmax(y_true, axis=1),np.argmax(y_pred, axis=1))))
In [27]:
import seaborn as sns
from sklearn.metrics import confusion_matrix
labels_name = ['qcd', 'tt', 'wjets']
labels = [0,1,2]
cm = confusion_matrix(np.argmax(y_true, axis=1), np.argmax(y_pred, axis=1), labels=labels)
## Normalize CM
cm = cm / cm.astype(np.float).sum(axis=1)
fig, ax = plt.subplots()
ax = sns.heatmap(cm, annot=True, fmt='g')
ax.xaxis.set_ticklabels(labels_name)
ax.yaxis.set_ticklabels(labels_name)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()
ROC and AUC¶
In [28]:
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(3):
fpr[i], tpr[i], _ = roc_curve(y_true[:, i], y_pred[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
In [29]:
# Dictionary containign ROC-AUC for the three classes
roc_auc
Out[29]:
In [30]:
%matplotlib notebook
# Plot roc curve
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-darkgrid')
plt.figure()
plt.plot(fpr[0], tpr[0], lw=2,
label='Inclusive classifier (AUC) = %0.4f' % roc_auc[0])
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$tt$ selector')
plt.legend(loc="lower right")
plt.show()
In [ ]: