Long Short-Term Memory

Introduction

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist. (Figure 1)

What is LSTM?

LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture widely used in Deep Learning. It excels at capturing long-term dependencies, making it ideal for sequence prediction tasks.

Unlike traditional neural networks, LSTM incorporates feedback connections, allowing it to process entire sequences of data, not just individual data points. This makes it highly effective in understanding and predicting patterns in sequential data like time series, text, and speech.

Overview of Recurrent Neural Network

In the figure 1 diagram, a chunk of neural network, A, looks at some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

How do LSTMs Work?

Here's a breakdown of how LSTMs work:

1. Cell State:

Imagine an LSTM unit as a tiny factory with a conveyor belt running through it. This conveyor belt is called the cell state, and it's where the network stores information for long periods. Unlike traditional RNNs, the cell state is protected by gates, which means information can only be added or removed carefully.

2. Gates:

There are three gates in an LSTM unit:

Forget Gate: Decides which information to throw away from the cell state. It works like a filter, letting through only the relevant stuff.
Input Gate: Decides which new information to add to the cell state. It considers both the current input and the previous hidden state (the output of the LSTM unit at the previous time step).
Output Gate: Decides what information from the cell state to output. This output is then used by the rest of the network.

3. Processing Steps:

At each time step, the LSTM unit performs the following steps:

Read the input: The network receives new data (like a word in a sentence or a point in a time series).
Update the cell state: The forget gate, input gate, and cell state all work together to update the information stored on the conveyor belt.
Compute the output: The output gate decides what information from the cell state to share with the rest of the network.

4. Sequence Processing:

LSTMs are particularly powerful because they can process entire sequences of data, not just individual data points. This means they can learn long-term dependencies and context, which is crucial for tasks like speech recognition, language translation, and time series forecasting.

Here are some additional points to keep in mind:

LSTMs have multiple layers, with each layer receiving the output from the previous layer as input. This allows them to learn complex relationships in the data.
LSTMs can be trained on a variety of data types, including text, audio, and video.
LSTMs are still an active area of research, and new variations and improvements are constantly being developed.

Example of How LSTM's Work

Let’s take an example to understand how LSTM works. Here we have two sentences separated by a full stop. The first sentence is “Bob is a nice person,” and the second sentence is “Dan, on the Other hand, is evil”. It is very clear, in the first sentence, we are talking about Bob, and as soon as we encounter the full stop(.), we started talking about Dan.

As we move from the first sentence to the second sentence, our network should realize that we are no more talking about Bob. Now our subject is Dan. Here, the Forget gate of the network allows it to forget about it. Let’s understand the roles played by these gates in LSTM architecture.

Forget Gate

In a cell of the LSTM neural network, the first step is to decide whether we should keep the information from the previous time step or forget it. Here is the equation for forget gate.

The sigmoid activation function squashes the values between 0 and 1, acting as a gate to control the flow of information. A value close to 0 means that the corresponding information is more likely to be forgotten, while a value close to 1 means that the information is more likely to be retained in the cell state.

Input Gate

Let’s take another example.

“Bob knows swimming. He told me over the phone that he had served the navy for four long years.”

So, in both these sentences, we are talking about Bob. However, both give different kinds of information about Bob. In the first sentence, we get the information that he knows swimming. Whereas the second sentence tells, he uses the phone and served in the navy for four years.

Now just think about it, based on the context given in the first sentence, which information in the second sentence is critical? First, he used the phone to tell, or he served in the navy. In this context, it doesn’t matter whether he used the phone or any other medium of communication to pass on the information. The fact that he was in the navy is important information, and this is something we want our model to remember for future computation. This is the task of the Input gate.

The input gate is used to quantify the importance of the new information carried by the input. Here is the equation of the input gate

Again we have applied the sigmoid function over it. As a result, the value of i at timestamp t will be between 0 and 1.

Updating the cell state

This equation reflects the LSTM's ability to selectively remember or forget information from the previous cell state and integrate new information from the current input. The forget gate ft and input gate it play crucial roles in this process, allowing LSTMs to manage and update their memory over time.

Output Gate

Now consider this sentence.

“Bob single-handedly fought the enemy and died for his country. For his contributions, brave______.”

During this task, we have to complete the second sentence. Now, the minute we see the word brave, we know that we are talking about a person. In the sentence, only Bob is brave, we can not say the enemy is brave, or the country is brave. So based on the current expectation, we have to give a relevant word to fill in the blank. That word is our output, and this is the function of our Output gate.

The output gate activation (ot) determines how much of the information from the current cell state (Ct) should be used to produce the output (ht) for the current time step. The output gate allows the LSTM to selectively pass information from the memory cell to the output, regulating the flow of information and enabling the network to capture relevant patterns in the sequential data.

LSTM Applications

LSTM networks find useful applications in the following areas:

Language modeling
Machine translation
Handwriting recognition
Image captioning
Image generation using attention models
Question answering
Video-to-text conversion
Polymorphic music modeling
Speech synthesis
Protein secondary structure prediction

What are Bidirectional LSTMs?

These are like an upgrade over LSTMs. In bidirectional LSTMs, each training sequence is presented forward and backward so as to separate recurrent nets. Both sequences are connected to the same output layer. Bidirectional LSTMs have complete information about every point in a given sequence, everything before and after it.

But, how do you rely on the information that hasn’t happened yet? The human brain uses its senses to pick up information from words, sounds, or from whole sentences that might, at first, make no sense but mean something in a future context. Conventional recurrent neural networks are only capable of using the previous context to get information. Whereas, in bidirectional LSTMs, the information is obtained by processing the data in both directions within two hidden layers, pushed toward the same output layer. This helps bidirectional LSTMs access long-range context in both directions.