<img src="http://oproject.org/img/ROOT.png" height="30%" width="30%">
<img src="http://oproject.org/img/tmvalogo.png" height="30%" width="30%">

# Regression Example

## Declare Factory
Initiate the TMVA library, get the data sample from github, and create a factory to do the regression.

In [1]:
TMVA::Tools::Instance();

auto inputFile = TFile::Open("https://raw.githubusercontent.com/iml-wg/tmvatutorials/master/inputdata.root");
auto outputFile = TFile::Open("TMVAOutputBDT.root", "RECREATE");

TMVA::Factory factory("TMVARegression", outputFile,
                      "!V:!Silent:Color:DrawProgressBar:AnalysisType=Regression" ); 

## Declare DataLoader
Define the features and the target for the regression.

In [2]:
TMVA::DataLoader loader("dataset"); 

// Add the feature variables, names reference branches in inputFile ttree
loader.AddVariable("var1");
loader.AddVariable("var2");
loader.AddVariable("var3");
loader.AddVariable("var4");
loader.AddVariable("var5 := var1-var3"); // create new features
loader.AddVariable("var6 := var1+var2");

loader.AddTarget( "target := var2+var3" ); // define the target for the regression


## Setup Dataset
Link dataloader to dataset.

In [3]:
TTree *tree;
inputFile->GetObject("Sig", tree);

TCut mycuts = ""; // e.g. TCut mycuts = "abs(var1)<0.5";

loader.AddRegressionTree(tree, 1.0);   // link the TTree to the loader, weight for each event  = 1
loader.PrepareTrainingAndTestTree(mycuts,
                                   "nTrain_Regression=1000:nTest_Regression=1000:SplitMode=Random:NormMode=NumEvents:!V" );

DataSetInfo              : [dataset] : Added class "Regression"
                         : Add Tree Sig of type Regression with 6000 events
                         : Dataset[dataset] : Class index : 0  name : Regression


# Book The Regression Method

Book the method for regression. Here we choose the Boosted Decision Tree model. You have to use gradient boosted trees for regression, hence the BDTG and BoostType=Grad. 

Define the hyperparameters: ntrees, boosttype, shrinkage, and the depth. Also define the loss function you want to use: 'AbsoluteDeviation', 'Huber', or 'LeastSquares'. nCuts determines how finely to look at each feature. Larger values take more time, but you may get more accurate results.

In [4]:
// Boosted Decision Trees 
factory.BookMethod(&loader,TMVA::Types::kBDT, "BDTG",
                   TString("!H:!V:NTrees=64::BoostType=Grad:Shrinkage=0.3:nCuts=20:MaxDepth=4:")+
                   TString("RegressionLossFunctionBDTG=AbsoluteDeviation"));

Factory                  : Booking method: [1mBDTG[0m
                         : 
                         : the option *InverseBoostNegWeights* does not exist for BoostType=Grad --> change
                         : to new default for GradBoost *Pray*
DataSetFactory           : [dataset] : Number of events in input trees
                         : 
                         : Number of training and testing events
                         : ---------------------------------------------------------------------------
                         : Regression -- training events            : 1000
                         : Regression -- testing events             : 1000
                         : Regression -- training and testing events: 2000
                         : 
DataSetInfo              : Correlation matrix (Regression):
                         : --------------------------------------------------------------
                         :               var1    var2    var3    var4 var1-

# Train Method

In [5]:
factory.TrainAllMethods();

Factory                  : [1mTrain all methods[0m
Factory                  : [dataset] : Create Transformation "I" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'var1' <---> Output : variable 'var1'
                         : Input : variable 'var2' <---> Output : variable 'var2'
                         : Input : variable 'var3' <---> Output : variable 'var3'
                         : Input : variable 'var4' <---> Output : variable 'var4'
                         : Input : variable 'var5' <---> Output : variable 'var5'
                         : Input : variable 'var6' <---> Output : variable 'var6'
TFHandler_Factory        : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :     var1:    0.23134    0.98776   [    -3.3494     3.0772 ]
      

                         : [1;42m[33m[[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m][0m[0m ([1;31m1%[0m, time left: [1;31munknown[0m[0m)                          : [1;42m[33m[[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m][0m[0m ([1;31m3%[0m, time left: [1;31m0 sec[0m[0m)                          : [1;42m[33m[[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m

                         : Elapsed time for training with 1000 events: [1;31m0.119 sec[0m         
                         : Dataset[dataset] : Create results for training
                         : Dataset[dataset] : Evaluation of BDTG on training sample
                         : Dataset[dataset] : Elapsed time for evaluation of 1000 events: [1;31m0.0683 sec[0m       
                         : Create variable histograms
                         : Create regression target histograms
                         : Create regression average deviation
                         : Results created
                         : Creating xml weight file: [0;36mdataset/weights/TMVARegression_BDTG.weights.xml[0m


                         : [1;42m[33m[[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m][0m[0m ([1;31m75%[0m, time left: [1;31m0 sec[0m[0m)                          : [1;42m[33m[[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m][0m[0m ([1;31m76%[0m, time left: [1;31m0 sec[0m[0m)                          : [1;42m[33m[[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m[33m>[0m[1;42m

Factory                  : Training finished
                         : 
Factory                  : === Destroy and recreate all methods via weight files for testing ===
                         : 


# Test and Evaluate the Model

In [6]:
factory.TestAllMethods();
factory.EvaluateAllMethods();    

Factory                  : [1mTest all methods[0m
Factory                  : Test method: BDTG for Regression performance
                         : 
                         : Dataset[dataset] : Create results for testing
                         : Dataset[dataset] : Evaluation of BDTG on testing sample


                         : [1;42m[33m[[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m][0m[0m ([1;31m0%[0m, time left: [1;31munknown[0m[0m)                          : [1;42m[33m[[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m][0m[0m ([1;31m0%[0m, time left: [1;31m0 sec[0m[0m)                          : [1;42m[33m[[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m[33m.[0m[1;42m

                         : Dataset[dataset] : Elapsed time for evaluation of 1000 events: [1;31m0.109 sec[0m       
                         : Create variable histograms
                         : Create regression target histograms
                         : Create regression average deviation
                         : Results created
Factory                  : [1mEvaluate all methods[0m
                         : Evaluate regression method: BDTG
TFHandler_BDTG           : Variable        Mean        RMS   [        Min        Max ]
                         : -----------------------------------------------------------
                         :     var1:    0.18427     1.0189   [    -3.3780     3.2875 ]
                         :     var2:    0.28570    0.98438   [    -3.2880     3.4734 ]
                         :     var3:    0.41410    0.99893   [    -2.6232     4.6422 ]
                         :     var4:    0.79156     1.0958   [    -2.9492     4.0073 ]
                     

## Gather and Plot the Results
Let's plot the residuals for the BDTG predictions. First, close the output file so that it saves to disk and we can open it without issue. Then get the results on the test set. Finally, plot the residuals.

In [7]:
%jsroot on
outputFile->Close();
auto resultsFile = TFile::Open("TMVAOutputBDT.root");
auto resultsTree = resultsFile->Get("dataset/TestTree"); 
TCanvas c;
resultsTree->Draw("BDTG-target"); // BDTG is the predicted value, target is the true value
c.Draw();