Hyperparameter Tuning

Note: this course is a part of series of courses, you can start it from Intro to Deep Learning with TensorFlow.

Introduction to hyperparameter tuning

We bet you were excited to build your first neural network model, train, and evaluate it. But usually the story doesn’t end there. Previously, we chose certain values for the following:

the learning rate
number of batches
number of epochs
number of units per hidden layer
activation functions.

These are all hyperparameters. Why did we choose those numbers and values? Sometimes you start with a “hunch,” by looking at what other machine learning engineers often use in their solutions.

Almost any machine learning project you start will have the pipeline depicted in the diagram below.

First, we split data into the following sets:

A training set to fit the model by updating the parameters (weights) of a neural network model. Previously, we randomly selected 67% (two thirds) of our data to be the training set.
A validation set for checking the model’s fit. It is a biased evaluation since the model is modified based on its performance on this set. We haven’t set aside a validation set until now.
A test set is used for an unbiased evaluation of the model. The test set should never be used for hyperparameter selection and tweaking! Rather, it is used to compare different models. For example, in Kaggle competitions, the leaderboard is determined based on the performance on a test set. Previously, we allocated 33% of the total data to be the test set.

Second, we set hyperparameters to some initial values. We fit the model to the training data and check how it performs on the validation data by looking at accuracy for classification or mean absolute error for regression. Finally, if the performance is not satisfactory, we change the hyperparameters and repeat the steps. Otherwise, we evaluate the final model on the test set and report the results.

What percentage of our data should we set aside for our training, validation, and test sets? Well, it depends on the amount of data you have and the task you are performing. Usually, machine learning engineers choose 70% of data for training, 20% for validating, and 10% for testing. But other splits are possible.

Using a validation set for hyperparameter tuning

Using the training data to choose hyperparameters might lead to overfitting to the training data meaning the model learns patterns specific to the training data that would not apply to new data. For that reason, hyperparameters are chosen on a held-out set called validation set . In TensorFlow Keras, validation split can be specified as a parameter in the .fit() function:

Manual tuning: learning rate

Neural networks are trained with the gradient descent algorithm and one of the most important hyperparameters in the network training is the learning rate. The learning rate determines how big of a change you apply to the network weights as a consequence of the error gradient calculated on a batch of training data.

A larger learning rate leads to a faster learning process at a cost to be stuck in a suboptimal solution (local minimum). A smaller learning rate might produce a good suboptimal or global solution, but it will take it much longer to converge. In the extremes, a learning rate too large will lead to an unstable learning process oscillating over the epochs. A learning rate too small may not converge or get stuck in a local minimum.

It can be helpful to test different learning rates as we change our hyperparameters. A learning rate of 1.0 leads to oscillations, 0.01 may give us a good performance, while 1e-07 is too small and almost nothing is learned within the allotted time.

Manual tuning: batch size

The batch size is a hyperparameter that determines how many training samples are seen before updating the network’s parameters (weight and bias matrices).

When the batch contains all the training examples, the process is called batch gradient descent. If the batch has one sample, it is called the stochastic gradient descent. And finally, when 1 < batch size < number of training points, is called mini-batch gradient descent. An advantage of using batches is for GPU computation that can parallelize neural network computations.

How do we choose the batch size for our model? On one hand, a larger batch size will provide our model with better gradient estimates and a solution close to the optimum, but this comes at a cost of computational efficiency and good generalization performance. On the other hand, smaller batch size is a poor estimate of the gradient, but the learning is performed faster. Finding the “sweet spot” depends on the dataset and the problem, and can be determined through hyperparameter tuning.

For this experiment, we fix the learning rate to 0.01 and try the following batch sizes: 1, 2, 10, and 16. Notice how small batch sizes have a larger variance (more oscillation in the learning curve).

Want to improve the performance with a larger batch size? A good trick is to increase the learning rate!

Manual tuning: epochs and early stopping

Being an iterative process, gradient descent iterates many times through the training data. The number of epochs is a hyperparameter representing the number of complete passes through the training dataset. This is typically a large number (100, 1000, or larger). If the data is split into batches, in one epoch the optimizer will see all the batches.

How do you choose the number of epochs? Too many epochs can lead to overfitting, and too few to underfitting. One trick is to use early stopping: when the training performance reaches the plateau or starts degrading, the learning stops.

To illustrate we can introduce some overfitting by increasing the number of parameters in the neural network model and get the following plot:

We know we are overfitting because the validation error at first decreases but eventually starts increasing. The final validation MAE is ~3034, while the training MAE is ~1000. That’s a big difference. We see that the training could have been stopped earlier (around epoch 50).

We can specify early stopping in TensorFlow with Keras by creating an EarlyStopping callback and adding it as a parameter when we fit our model.

Here, we include the following:

monitor = val_loss, which means we are monitoring the validation loss to decide when to stop the training
mode = min, which means we seek minimal loss
patience = 40, which means that if the learning reaches a plateau, it will continue for 40 more epochs in case the plateau leads to improved performance

Manual tuning: changing the model

We saw in the previous exercise that if you have a big model and you train too long, you might overfit. Let us see the opposite - having a too simple model.

In the code on the right, we compare a one-layer neural network and a model with a single hidden layer. The models look like this:

and

When we run the learning we get the learning curves depicted on the far right.

If you observe Plot #1 for the model with no hidden layers you will see an issue: the validation curve is below the training curve. This means that the training curve can get better at some point, but the model complexity doesn’t allow it. This phenomenon is called underfitting. You can also notice that no early stopping occurs here since the performance of this model is bad all the time.

Plot #1: a model without hidden layers

Plot #2 is for the model with a single hidden layer. You can observe a well-behaving curve with the early stopping at epoch 38 and we have a much better result. Nice!

Plot #2: a model with one hidden layer

How do we choose the number of hidden layers and the number of units per layer? That is a tough question and there is no good answer. The rule of thumb is to start with one hidden layer and add as many units as we have features in the dataset. However, this might not always work. We need to try things out and observe our learning curve.

Towards automated tuning: grid and random search

So far we’ve been manually setting and adjusting hyperparameters to train and evaluate our model. If we didn’t like the result, we changed the hyperparameters to some other values. However, this is rather cumbersome; it would be nice if we could make these changes in a systematic and automated way. Fortunately, there are some strategies for automated hyperparameter tuning, including the following two.

Grid search, or exhaustive search, tries every combination of desired hyperparameter values. If, for example, we want to try learning rates of 0.01 and 0.001 and batch sizes of 10, 30, and 50, grid search will try six combinations of parameters (0.01 and 10, 0.01 and 30, 0.01 and 50, 0.001 and 10, and so on). This obviously gets very computationally demanding when we increase the number of values per hyperparameter or the number of hyperparameters we want to tune.

On the other hand, Random Search goes through random combinations of hyperparameters and doesn’t try them all.

Grid search in Keras

To use GridSearchCV from scikit-learn for regression we need to first wrap our neural network model into a KerasRegressor:

model = KerasRegressor(build_fn=design_model)

Then we need to setup the desired hyperparameters grid (we don’t use many values for the sake of speed):

Finally, we initialize a GridSearchCV object and fit our model to the data:

Notice that we initialized the scoring parameter with scikit-learn’s .make_scorer() method. We’re evaluating our hyperparameter combinations with a mean squared error making sure that greater_is_better is set to False since we are searching for a set of hyperparameters that yield us the smallest error.

Randomized search in Keras

We first change our hyperparameter grid specification for the randomized search in order to have more options:

Randomized search will sample values for batch_size and nb_epoch from uniform distributions in the interval [2, 16] and [10, 100], respectively, for a fixed number of iterations. In our case, 12 iterations:

We cover only simpler cases here, but you can set up GridSearchCV and RandomizedSearchCV to tune over any hyperparameters you can think of: optimizers, number of hidden layers, number of neurons per layer, and so on.

Next course is about Regularization: dropout.