XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data.
XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.
In this post you will discover XGBoost and get a gentle introduction to what is, where it came from and how you can learn more.
After reading this post you will know:
- What XGBoost is and the goals of the project.
- Why XGBoost must be a part of your machine learning toolkit.
- Where you can learn more to start using XGBoost on your next machine learning project.
Kick-start your project with my new book XGBoost With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- Updated Feb/2021: Fixed broken links.
Need help with XGBoost in Python?
Take my free 7-day email course and discover xgboost (with sample code).
Click to sign-up now and also get a free PDF Ebook version of the course.
Start Your FREE Mini-Course Now!
What is XGBoost?
XGBoost stands for eXtreme Gradient Boosting.
The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost.
— Tianqi Chen, in answer to the question “What is the difference between the R gbm (gradient boosting machine) and xgboost (extreme gradient boosting)?” on Quora
It is an implementation of gradient boosting machines created by Tianqi Chen, now with contributions from many developers. It belongs to a broader collection of tools under the umbrella of the Distributed Machine Learning Community or DMLC who are also the creators of the popular mxnet deep learning library.
Tianqi Chen provides a brief and interesting back story on the creation of XGBoost in the post Story and Lessons Behind the Evolution of XGBoost.
XGBoost is a software library that you can download and install on your machine, then access from a variety of interfaces. Specifically, XGBoost supports the following main interfaces:
- Command Line Interface (CLI).
- C++ (the language in which the library is written).
- Python interface as well as a model in scikit-learn.
- R interface as well as a model in the caret package.
- Java and JVM languages like Scala and platforms like Hadoop.
The library is laser focused on computational speed and model performance, as such there are few frills. Nevertheless, it does offer a number of advanced features.
The implementation of the model supports the features of the scikit-learn and R implementations, with new additions like regularization. Three main forms of gradient boosting are supported:
- Gradient Boosting algorithm also called gradient boosting machine including the learning rate.
- Stochastic Gradient Boosting with sub-sampling at the row, column and column per split levels.
- Regularized Gradient Boosting with both L1 and L2 regularization.
The library provides a system for use in a range of computing environments, not least:
- Parallelization of tree construction using all of your CPU cores during training.
- Distributed Computing for training very large models using a cluster of machines.
- Out-of-Core Computing for very large datasets that don’t fit into memory.
- Cache Optimization of data structures and algorithm to make best use of hardware.
The implementation of the algorithm was engineered for efficiency of compute time and memory resources. A design goal was to make the best use of available resources to train the model. Some key algorithm implementation features include:
- Sparse Aware implementation with automatic handling of missing data values.
- Block Structure to support the parallelization of tree construction.
- Continued Training so that you can further boost an already fitted model on new data.
XGBoost is free open source software available for use under the permissive Apache-2 license.
Why Use XGBoost?
The two reasons to use XGBoost are also the two goals of the project:
- Execution Speed.
- Model Performance.
1. XGBoost Execution Speed
Generally, XGBoost is fast. Really fast when compared to other implementations of gradient boosting.
Szilard Pafka performed some objective benchmarks comparing the performance of XGBoost to other implementations of gradient boosting and bagged decision trees. He wrote up his results in May 2015 in the blog post titled “Benchmarking Random Forest Implementations“.
He also provides all the code on GitHub and a more extensive report of results with hard numbers.
His results showed that XGBoost was almost always faster than the other benchmarked implementations from R, Python Spark and H2O.
From his experiment, he commented:
I also tried xgboost, a popular library for boosting which is capable to build random forests as well. It is fast, memory efficient and of high accuracy
— Szilard Pafka, Benchmarking Random Forest Implementations.
2. XGBoost Model Performance
XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems.
The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.
For example, there is an incomplete list of first, second and third place competition winners that used titled: XGBoost: Machine Learning Challenge Winning Solutions.
To make this point more tangible, below are some insightful quotes from Kaggle competition winners:
As the winner of an increasing amount of Kaggle competitions, XGBoost showed us again to be a great all-round algorithm worth having in your toolbox.
— Dato Winners’ Interview: 1st place, Mad Professors
When in doubt, use xgboost.
— Avito Winner’s Interview: 1st place, Owen Zhang
I love single models that do well, and my best single model was an XGBoost that could get the 10th place by itself.
— Caterpillar Winners’ Interview: 1st place
I only used XGBoost.
— Liberty Mutual Property Inspection, Winner’s Interview: 1st place, Qingchen Wang
The only supervised learning method I used was gradient boosting, as implemented in the excellent xgboost package.
— Recruit Coupon Purchase Winner’s Interview: 2nd place, Halla Yang
What Algorithm Does XGBoost Use?
The XGBoost library implements the gradient boosting decision tree algorithm.
This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines.
Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the AdaBoost algorithm that weights data points that are hard to predict.
Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
This approach supports both regression and classification predictive modeling problems.
For more on boosting and gradient boosting, see Trevor Hastie’s talk on Gradient Boosting Machine Learning.
Official XGBoost Resources
The best source of information on XGBoost is the official GitHub repository for the project.
From there you can get access to the Issue Tracker and the User Group that can be used for asking questions and reporting bugs.
A great source of links with example code and help is the Awesome XGBoost page.
There is also an official documentation page that includes a getting started guide for a range of different languages, tutorials, how-to guides and more.
There are some more formal papers on XGBoost that are worth a read for more background on the library:
Talks on XGBoost
When getting started with a new tool like XGBoost, it can be helpful to review a few talks on the topic before diving into the code.
XGBoost: A Scalable Tree Boosting System
Tianqi Chen, the creator of the library gave a talk to the LA Data Science group in June 2016 titled “XGBoost: A Scalable Tree Boosting System“.
You can review the slides from his talk here:
There is more information on the DataScience LA blog.
XGBoost: eXtreme Gradient Boosting
Tong He, a contributor to XGBoost for the R interface gave a talk at the NYC Data Science Academy in December 2015 titled “XGBoost: eXtreme Gradient Boosting“.
You can review the slides from his talk here:
There is more information about this talk on the NYC Data Science Academy blog.
There is a comprehensive installation guide on the XGBoost documentation website.
It covers installation for Linux, Mac OS X and Windows.
It also covers installation on platforms such as R and Python.
XGBoost in R
If you are an R user, the best place to get started is the CRAN page for the xgboost package.
From this page you can access the R vignette Package ‘xgboost’ [pdf].
There are also some excellent R tutorials linked from this page to get you started:
There is also the official XGBoost R Tutorial and Understand your dataset with XGBoost tutorial.
XGBoost in Python
Installation instructions are available on the Python section of the XGBoost installation guide.
The official Python Package Introduction is the best place to start when working with XGBoost in Python.
To get started quickly, you can type:
There is also an excellent list of sample source code in Python on the XGBoost Python Feature Walkthrough.
In this post you discovered the XGBoost algorithm for applied machine learning.
- That XGBoost is a library for developing fast and high performance gradient boosting tree models.
- That XGBoost is achieving the best performance on a range of difficult machine learning tasks.
- That you can use this library from the command line, Python and R and how to get started.
Have you used XGBoost? Share your experiences in the comments below.
Do you have any questions about XGBoost or about this post? Ask your question in the comments below and I will do my best to answer them.