However, we can’t optimize this objective function with traditional methods in Euclidean space - because our model takes functions as parameters.
And what about the regularization term?
Remember our definition of a tree: \[
f_{t}(x)=w_{q(x)},\quad w\in \mathbb{R}^{T}, q:\mathbb{R}^{d}\rightarrow \{1,2,\dots ,T\}
\]\(w\) is the vector score on leaves, \(q\) is a function that assign each data point to the corresponding leaf, and \(T\) is the number of leaves.
We define the complexity as \[
\omega(f)=\gamma T+ \frac{1}{2}\lambda \sum^{T}_{j=1}w^{2}_{j}
\]
Learning the tree structure
Use the Exact Greedy Algorithm:
Start with single root - contains all training examples
Iterate over all features and values per feature1 and evaluate each possible split gain
Output split with best score
The gain for the best split must be positive, otherwise we stop growing the branch
Gain
We try to split a leaf into two leaves, and the score it gains is \[
Gain=\frac{1}{2}\left[ \frac{G^{2}_{L}}{H_{L}+\lambda}+ \frac{G^{2}_{R}}{H_{R}+\lambda}-\frac{(G_{L}+G_{R})^{2}}{H_{L}+H_{R}+\lambda} \right]-\gamma
\]
The score on the new left leaf
The score on the new right leaf
The score if we do not split
The complexity cost by adding additional leaf
Further prevention of overfitting
XGBoost also uses two additional methods1 to prevent overfitting:
Shrinkage: scales newly added weights by a factor \(\eta\) after each step of tree boosting.
Column (feature) sub-sampling: randomly select a subset of features for each tree (or split), so not all features are available when finding best split. This also speeds up computations of the parallel algorithm.
Making it scalable
Exact greedy algorithm is impossible to do efficiently when the data doesn’t entirely fit in memory - or when in a distributed setting.
Solution: the approximate algorithm.
It finds split points for continuous features by using histogram to approximate quantiles of the data distribution
Don’t consider every feature value, just the boundary values as potential splits
Uses weighted quantile sketches to determine the bin boundaries
Sparsity-aware Split Finding
Handle sparse data by adding default directions to each tree node.
This also exploits the sparsity to make computation complexity linear with non-missing entries in input
Stored in compressed column (CSC) format - each column is sorted by corresponding feature vaue
Cache-aware access
Block structure helps computation complexity of split finding, but the new algorithm requires non-continuous memory access
Want to get cache hits
Solution: Use prefetching
Cache-aware algorithm implementation of the exact greedy algorithm runs 2x fast as naive version for large datasets
For the approximate algorithm, the problem is solved by using a correct block size
Blocks for Out-of-core Computation
Can make the system even more performant by better utilizing disk space to handle data that doesn’t fit in main memory
Divide data into multiple blocks - store them on disk
Use independent thread to pre-fetch the block into main memory buffer, so computation & disk reading happens concurrently
Also uses block compression and block sharding
Conclusion
Two Strong Points of the Article
True end-to-end system optimization. Very impressive focus on creating a scalable, performant system on all levels: distributed hardware, hardware, software, and algorithm.
Impressive flexibility, enabled by accepting arbitrary loss functions: can do regression, classification, ranking, etc. with XGBoost.
Two Weak Points of the Article
Evaluation was very focused on efficiency. Would have liked to see more about performance on various tasks.1
The focus on so many different optimizations makes the paper very broad
Take-Home Message
Why XGBoost?
Superior predictions to many other algorithms, especially on tabular data
Incredibly efficient end-to-end system
And hopefully every word of “XGBoost: A Scalable Tree Boosting System” now makes sense
Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. KDD ’16. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785.
Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.”The Annals of Statistics 29 (5): 1189–1232. https://doi.org/10.1214/aos/1013203451.