Understanding eXtreme Gradient Boosting

Learn with Python

Apr 21, 2023

eXtreme Gradient Boosting (XGBoost) is a gradient boosting algorithm that is designed to improve the accuracy of machine learning models. It is based on decision trees and works by iteratively adding decision trees to a model. Each new decision tree is trained to correct the errors made by the previous decision tree. The algorithm uses a gradient descent optimization method to minimize the loss function and find the optimal set of weights for each decision tree.

To demonstrate the XGBoost algorithm, we will use the code below to generate a set of data with 100 rows and 5 columns of random decimal values between 0 and 1. This will serve as our X.

We will also generate a 100 rows of integers of values of 1 or 0 as our y.

Next, we will create a DMatrix object ‘dtrain’ using X_train and y_train arrays as input values.

Now that we have the dtrain object created, we can set the hyperparameters of the XGBoost model.

Max_depth controls the complexity of the decision trees in the model.

eta is the learning rate of the XGBoost model. It is the step sizes taken during the optimization process and helps prevent overfitting.

objective is the loss function being used by the XGBoost model to optimize predictions. In this case, 'binary:logistic' objective is being used, which is appropriate for binary classification problems. If this is a multiclass classification, then the 'multi:softmax' objective could be used.

Before we train our XGBoost model, we’ll need to set the 'num_rounds', which is a parameter that specifies the number of boosting rounds to be built during training. Each boosting round tries to improve the overall model performance by adding a new decision tree that corrects the errors made by the previous trees.

With all the parameters set up in the previous steps, we are now ready to train our XGBoost model. It trains the data called 'dtrain' using the parameter settings specified in 'params' by running boosting rounds up to 10, which is also the number of decision trees to be built.

The 'train()' method returns the trained model as a 'Booster' object, which can be used to make predictions on new data or to inspect the model's structure and feature importance.

To visualize the model, the below code will be used:

In the visualization, each node in a tree represents a decision based on one feature. The splitting point is determined by minimizing the loss function, which in the case of XGBoost is the gradient of the loss function.

During each boosting iteration, XGBoost fits a new decision tree to the negative gradient of the loss function with respect to the predicted values from the previous model. This means that the new decision tree is trained to correct the errors made by the previous decision tree.

New trees are built to capture the remaining errors in the residual plot, which represents the difference between the actual and predicted values. Over time, the trees become more complex and capture more of the remaining errors, resulting in better overall performance.

Below is the medium article where XGBoost was used in fraud detection:

CodeChat

I also hold code talks on Google Meet on the last Friday of every month at 5:00 p.m. The topic for the next chat is going to be on Digesting Decision Trees in Python as well as any questions you may have related to coding. You may sign up here for a meeting reminder or the meeting link is here if you would like to join directly.

Feedback

The Substackers’ message board is a place where you can share your coding journey with me, so that we can exchange ideas and become better together.

Please open the message board and share with me your thoughts!

Penny’s Coding Bits

Discussion about this post