The goal of developing a predictive model is to develop a model that is accurate on unseen data.

This can be achieved using statistical techniques where the training dataset is carefully used to estimate the performance of the model on new and unseen data.

In this tutorial you will discover how you can evaluate the performance of your gradient boosting models with XGBoost in Python.

After completing this tutorial, you will know.

- How to evaluate the performance of your XGBoost models using train and test datasets.
- How to evaluate the performance of your XGBoost models using k-fold cross validation.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.**Update Mar/2018**: Added alternate link to download the dataset as the original appears to have been taken down.

### Need help with XGBoost in Python?

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

## Evaluate XGBoost Models With Train and Test Sets

The simplest method that we can use to evaluate the performance of a machine learning algorithm is to use different training and testing datasets.

We can take our original dataset and split it into two parts. Train the algorithm on the first part, then make predictions on the second part and evaluate the predictions against the expected results.

The size of the split can depend on the size and specifics of your dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.

This algorithm evaluation technique is fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm you are investigating is slow to train.

A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of model accuracy.

We can split the dataset into a train and test set using the **train_test_split()** function from the scikit-learn library. For example, we can split the dataset into a 67% and 33% split for training and test sets as follows:

# split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) |

The full code listing is provided below using the Pima Indians onset of diabetes dataset, assumed to be in the current working directory.

Download the dataset and place it in your current working directory.

An XGBoost model with default configuration is fit on the training dataset and evaluated on the test dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# train-test split evaluation of xgboost model from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) # fit model no training data model = XGBClassifier() model.fit(X_train, y_train) # make predictions for test data y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] # evaluate predictions accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example summarizes the performance of the model on the test set.

## Evaluate XGBoost Models With k-Fold Cross Validation

Cross validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split.

It works by splitting the dataset into k-parts (e.g. k=5 or k=10). Each split of the data is called a fold. The algorithm is trained on k-1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set.

After running cross validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data given your test data. It is more accurate because the algorithm is trained and evaluated multiple times on different data.

The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of observations, k values of 3, 5 and 10 are common.

We can use k-fold cross validation support provided in scikit-learn. First we must create the KFold object specifying the number of folds and the size of the dataset. We can then use this scheme with the specific dataset. The **cross_val_score()** function from scikit-learn allows us to evaluate a model using the cross validation scheme and returns a list of the scores for each model trained on each fold.

kfold = KFold(n_splits=10, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) |

The full code listing for evaluating an XGBoost model with k-fold cross validation is provided below for completeness.

# k-fold cross validation evaluation of xgboost model from numpy import loadtxt import xgboost from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # CV model model = xgboost.XGBClassifier() kfold = KFold(n_splits=10, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100)) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example summarizes the performance of the default model configuration on the dataset including both the mean and standard deviation classification accuracy.

If you have many classes for a classification type predictive modeling problem or the classes are imbalanced (there are a lot more instances for one class than another), it can be a good idea to create stratified folds when performing cross validation.

This has the effect of enforcing the same distribution of classes in each fold as in the whole training dataset when performing the cross validation evaluation. The scikit-learn library provides this capability in the StratifiedKFold class.

Below is the same example modified to use stratified cross validation to evaluate an XGBoost model.

# stratified k-fold cross validation evaluation of xgboost model from numpy import loadtxt import xgboost from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import cross_val_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # CV model model = xgboost.XGBClassifier() kfold = StratifiedKFold(n_splits=10, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100)) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example produces the following output.

## What Techniques to Use When

- Generally, k-fold cross validation is the gold-standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
- Use stratified cross validation to enforce class distributions when there are a large number of classes or an imbalance in instances for each class.
- Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.

The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions.

If in doubt, use 10-fold cross validation for regression problems and stratified 10-fold cross validation on classification problems.

## Summary

In this tutorial, you discovered how you can evaluate your XGBoost models by estimating how well they are likely to perform on unseen data.

Specifically, you learned:

- How to split your dataset into train and test subsets for training and evaluating the performance of your model.
- How you can create k XGBoost models on different subsets of the dataset and average the scores to get a more robust estimate of model performance.
- Heuristics to help choose between train-test split and k-fold cross validation for your problem.

Do you have any questions on how to evaluate the performance of XGBoost models or about this post? Ask your questions in the comments below and I will do my best to answer.