Layers in Neural Network

In the previous course we have discussed how to implement a neural network, in this course we will discuss the layers of the neural network.

Neural network model: layers

Layers are the building blocks of neural networks and can contain 1 or more neurons. Each layer is associated with parameters: weights, and bias, that are tuned during the learning. A fully-connected layer in which all neurons connect to all neurons in the next layer is created the following way in TensorFlow:

This layer looks like this graphically:

Pay attention to the dimensions of the weight and bias parameter matrices. Since we chose to create a layer with three neurons, the number of outputs of this layer is 3. Hence, the bias parameter would be a vector of (3, 1) dimensions. But what is the first dimension of the weights matrix? Without knowing how many features or input nodes are in the previous layer, we have no way of knowing! For that reason, with the following code:

we get an empty array since no input layer is specified. However, if we write:

we get that the weight matrix has shape = (11, 3) and the bias matrix has shape=(3,). Compare these weights with the diagram above to make sure you can associate the resulting shapes to it.

Fortunately, we don’t have to worry about this. TensorFlow will determine the shapes of the weight matrix and bias matrix automatically the moment it encounters the first input.

Neural network model: input layer

Inputs to a neural network are usually not considered the actual transformative layers. They are merely placeholders for data. In Keras, an input for a neural network can be specified with a tf.keras.layers.InputLayer object.

The following code initializes an input layer for a DataFrame my_data that has 15 columns:

Notice that the input_shape parameter has to have its first dimension equal to the number of features in the data. You don’t need to specify the second dimension: the number of samples or batch size.

The following code avoids hard-coding with using the .shape property of the my_data DataFrame:

The following code adds this input layer to a model instance my_model:

The following code prints a useful summary of a model instance my_model:

As you can see, the summary shows that the total number of parameters is 0. This shows you that the input layer has no trainable parameters and is just a placeholder for data.

Neural network model: output layer

The output layer shape depends on your task. In the case of regression, we need one output for each sample. For example, if your data has 100 samples, you would expect your output to be a vector with 100 entries - a numerical prediction for each sample.

In our case, we are doing regression and wish to predict one number for each data point: the medical cost billed by health insurance indicated in the charges column in our data. Hence, our output layer has only one neuron.

The following command adds a layer with one neuron to a model instance my_model:

Notice that you don’t need to specify the input shape of this layer since Tensorflow with Keras can automatically infer its shape from the previous layer.

Neural network model: hidden layers

So far we have added one input layer and one output layer to our model. If you think about it, our model currently represents a linear regression. To capture more complex or non-linear interactions among the inputs and outputs neural networks, we’ll need to incorporate hidden layers.

The following command adds a hidden layer to a model instance my_model:

We chose 64 (2^6) to be the number of neurons since it makes optimization more efficient due to the binary nature of computation.

With the activation parameter, we specify which activation function we want to have in the output of our hidden layer. There are a number of activation functions such as softmax, sigmoid, but ReLU (relu) (Rectified Linear Unit) is very effective in many applications and we’ll use it here.

Adding more layers to a neural network naturally increases the number of parameters to be tuned. With every layer, there are associated weight and bias vectors.

In the diagram below, we show the size of parameter vectors with each layer. In our case, the 1st layer’s weight matrix (red) has shape (11, 64) because we feed 11 features to 64 hidden neurons. The output layer (purple) has the weight matrix of shape (64, 1) because we have 64 input units and 1 neuron in the final layer.

Optimizers

As we mentioned, our goal is for the network to effectively adjust its weights or parameters in order to reach the best performance. Keras offers a variety of optimizers such as SGD (Stochastic Gradient Descent optimizer), Adam, RMSprop, and others.

We’ll start by introducing the Adam optimizer:

The learning rate determines how big of jumps the optimizer makes in the parameter space (weights and bias) and it is considered a hyperparameter that can be also tuned. While model parameters are the ones that the model uses to make predictions, hyperparameters determine the learning process (learning rate, number of iterations, optimizer type).

If the learning rate is set too high, the optimizer will make large jumps and possibly miss the solution. On the other hand, if set too low, the learning process is too slow and might not converge to a desirable solution with the allotted time. Here we’ll use a value of 0.01, which is often used.

Once the optimizer algorithm is chosen, a model instance my_model is compiled with the following code:

loss denotes the measure of learning success and the lower the loss the better the performance. In the case of regression, the most often used loss function is the Mean Squared Error mse (the average squared difference between the estimated values and the actual value).

Additionally, we want to observe the progress of the Mean Absolute Error (mae) while training the model because MAE can give us a better idea than mse on how far off we are from the true values in the units we are predicting. In our case, we are predicting charge in dollars and MAE will tell us how many dollars we’re off, on average, from the actual values as the network is being trained.

Training and evaluating the model

Now that we built the model we are ready to train the model using the training data.

The following command trains a model instance my_model using training data my_data and training labels my_labels :

model.fit() takes the following parameters:

my_data is the training data set.
my_labels are true labels for the training data points.
epochs refers to the number of cycles through the full training dataset. Since training of neural networks is an iterative process, you need multiple passes through data. Here we chose 50 epochs, but how do you pick a number of epochs? Well, it is hard to give one answer since it depends on your dataset. Amongst others, this is a hyperparameter that can be tuned — which we’ll cover later.
batch_size is the number of data points to work through before updating the model parameters. It is also a hyperparameter that can be tuned.
verbose = 1 will show you the progress bar of the training.

When the training is finalized, we use the trained model to predict values for samples that the training procedure haven’t seen: the test set.

The following commands evaluates the model instance my_model using the test data my_data and test labels my_labels:

In our case, model.evaluate() returns the value for our chosen loss metrics (mse) and for an additional metrics (mae).

So what is the final result? We should get ~$3884.21. This means that on average we’re off with our prediction by around 3800 dollars. Is that a good result or a bad result?

Often you need an expert or domain knowledge to decide this. What is an acceptable error for the application? Is $3800 a big error when deciding on insurance charges? Can you do better and how? As you see, the process doesn’t stop here.