XGBoost_with_Pandas_Parquet.ipynb

Traininig of the High Level Feature classifier using XGBoost on GPU¶

XGBoost This notebook trains a particle classifier using High Level Features. It uses XGBoost. Pandas is used to read the data and pass it to XGBoost.
Credits: this notebook is taken with permission from the work:

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020)
Code and data at:https://github.com/cerndb/SparkDLTrigger
The model is a classifier implemented as a DNN
- Model input: 14 "high level features", described in Topology classification with deep learning to improve real-time event selection at the LHC
- Model output: 3 classes, "W + jet", "QCD", "$t\bar{t}$"

Load train and test datasets via Pandas¶

In [1]:

# Download the datasets from 
# https://github.com/cerndb/SparkDLTrigger/tree/master/Data
#
# For CERN users, data is already available on EOS
PATH = "/eos/project/s/sparkdltrigger/public/"

import pandas as pd

testPDF = pd.read_parquet(path= PATH + 'testUndersampled_HLF_features.parquet', 
                          columns=['HLF_input', 'encoded_label'])

trainPDF = pd.read_parquet(path= PATH + 'trainUndersampled_HLF_features.parquet', 
                           columns=['HLF_input', 'encoded_label'])

In [2]:

# Check the number of events in the train and test datasets

num_test = testPDF.count()
num_train = trainPDF.count()

print('There are {} events in the test dataset'.format(num_test))
print('There are {} events in the train dataset'.format(num_train))

There are HLF_input        856090
encoded_label    856090
dtype: int64 events in the test dataset
There are HLF_input        3426083
encoded_label    3426083
dtype: int64 events in the train dataset

In [3]:

# Show the schema and a data sample of the test dataset
testPDF

Out[3]:

	HLF_input	encoded_label
0	[0.015150733133517018, 0.003511028294205839, 0...	[1.0, 0.0, 0.0]
1	[0.0, 0.003881822832783805, 0.7166341448458555...	[1.0, 0.0, 0.0]
2	[0.009639073600865505, 0.0010022659022912096, ...	[1.0, 0.0, 0.0]
3	[0.016354407625436572, 0.002108937905084598, 0...	[1.0, 0.0, 0.0]
4	[0.01925979125354152, 0.004603697276827594, 0....	[1.0, 0.0, 0.0]
...	...	...
856085	[0.020383967386165446, 0.0022348975484913444, ...	[0.0, 1.0, 0.0]
856086	[0.02475209699743233, 0.00867502196073073, 0.3...	[0.0, 1.0, 0.0]
856087	[0.03498179428310887, 0.02506331737284528, 0.9...	[0.0, 1.0, 0.0]
856088	[0.03735147362869153, 0.003645269183639405, 0....	[0.0, 1.0, 0.0]
856089	[0.04907273976147946, 0.003058462073646085, 0....	[0.0, 1.0, 0.0]

856090 rows × 2 columns

Convert training and test datasets from Pandas DataFrames to Numpy arrays¶

Now we will collect and convert the Pandas DataFrame into numpy arrays in order to be able to feed them to TensorFlow/Keras.

In [4]:

import numpy as np

X = np.stack(trainPDF["HLF_input"])
y = np.stack(trainPDF["encoded_label"])

X_test = np.stack(testPDF["HLF_input"])
y_test = np.stack(testPDF["encoded_label"])

XGBoost¶

In [6]:

import xgboost as xgb
from xgboost import XGBClassifier

xgb.__version__

Out[6]:

'2.0.3'

In [7]:

# Create model instance
# Use XGBoost on GPU resources
#bst = XGBClassifier(tree_method='gpu_hist', n_estimators=3, max_depth=2, learning_rate=1, objective='multi:softprob')

bst =  XGBClassifier(device = "cuda")

In [8]:

# Train the model on the training dataset
%time bst.fit(X, y)

CPU times: user 13 s, sys: 4.59 s, total: 17.6 s
Wall time: 12 s

Out[8]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device='cuda', early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluate the Classifier - Performance metrics¶

In [9]:

# make predictions
y_pred =  preds = bst.predict(X_test)

/cvmfs/sft-nightlies.cern.ch/lcg/views/dev3cuda/Mon/x86_64-centos7-gcc11-opt/lib/python3.9/site-packages/xgboost/core.py:160: UserWarning: [10:20:43] WARNING: /build/jenkins/workspace/lcg_nightly_pipeline/build/pyexternals/xgboost-2.0.3/src/xgboost/2.0.3/src/common/error_msg.cc:58: Falling back to prediction using DMatrix due to mismatched devices. This might lead to higher memory usage and slower performance. XGBoost is running on: cuda:0, while the input data is on: cpu.
Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.

This warning will only be shown once.

  warnings.warn(smsg, UserWarning)

In [10]:

from sklearn.metrics import accuracy_score

print('Accuracy of the HLF classifier: {:.4f}'.format(
    accuracy_score(np.argmax(y_test, axis=1),np.argmax(y_pred, axis=1))))

Accuracy of the HLF classifier: 0.9172

In [11]:

%matplotlib notebook

import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.metrics import confusion_matrix
labels_name = ['qcd', 'tt', 'wjets']
labels = [0,1,2]

cm = confusion_matrix(np.argmax(y_pred, axis=1), np.argmax(y_test, axis=1), labels=labels)

## Normalize CM
cm = cm / cm.astype(float).sum(axis=1)

fig, ax = plt.subplots()
ax = sns.heatmap(cm, annot=True, fmt='g')
ax.xaxis.set_ticklabels(labels_name)
ax.yaxis.set_ticklabels(labels_name)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()

No description has been provided for this image

ROC and AUC¶

In [14]:

from sklearn.metrics import roc_curve, auc

fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

In [15]:

# Dictionary containign ROC-AUC for the three classes 
roc_auc

Out[15]:

{0: 0.944949717170571, 1: 0.9418716018584723, 2: 0.9302782133005137}

In [16]:

%matplotlib notebook

# Plot roc curve 
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-darkgrid')

plt.figure()
plt.plot(fpr[0], tpr[0], lw=2, \
         label='HLF classifier (AUC) = %0.4f' % roc_auc[0])
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$tt$ selector')
plt.legend(loc="lower right")
plt.show()

In [ ]: