Introduction to Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a class of neural networks for modeling sequential data such as stock prices, an audio clip, a DNA sequence, a sequence of video frames, a sentence, and so on. The main differences in the architectures between feedforward neural networks (FNNs) and RNNs are: 1) RNNs work on the input and output of arbitrary lengths while FNNs have fixed-sized input and output, and 2) a neuron of RNN receives its own output (from the previous time step) as well as inputs (from training instances or previous layer) and sends its output back to itself. Let’s take a look at more details from a neuron level to a high level of RNN in the followings:

This figure shows how a single recurrent neuron looks like. It receives the inputs: $latex x^{<t>}$ that is a vector of an instance at the time step t and $latex a_i^{<t-1>}$ that is the output of the i-th activation function (i.e., the i-th neuron) of a layer from the previous time step t-1. The following equation shows how the output $latex a_i^{<t>}$ is computed for a single instance at the current time step t.:

$latex a_i^{<t>} = g(W_{a}^{(i)} \cdot a_i^{<t-1>} + W_{x}^{(i)} \cdot x^{<t>} + b_a^{(i)})$

where i indicates the i-th neuron of a layer.. $latex W_{a}$ (# of neurons $latex \times$ # of neurons) and $latex W_{x}$ (# of neurons $latex \times$ # of input features) are weight matrices of for the previous output $latex a^{(i)<t-1>}$ and the input $latex x^{<t>}$, respectively. $latex b_a$ is the vector of bias terms for the layer, and $latex g( \cdot )$ is the activation function (e.g., tanh, ReLU, etc.). These weights and the bias are the parameters that are optimized through the RNN trainings.

This figure might make it more clear to understand how a neuron works. It shows the unrolled recurrent neuron through time. $latex a^{<0>}$ is generally a vector of zeros.

Now, let’s take a look at the layer level process. $latex \hat{y}$ indicates an estimated value for a target value. The following equations show how the activation function and the estimated value are computed:

$latex a^{<t>} = g_1(W_a \cdot a^{<t-1>} + W_x \cdot x^{<t>} + b_a)$

$latex \hat{y}^{<t>} = g_2(W_y \cdot a^{<t>} + b_y)$

where $latex g_1( \cdot )$ is the activation function to output $latex a^{<t>}$. It could be tanh, ReLU, etc., and tanh is generally used in RNN. $latex g_2 ( \cdot )$ is the activation function for the output $latex \hat{y}^{<t>}$. If it is a binary classification problem, sigmoid could be used for this function.

Simple Code in TensorFlow

The followings are simple codes in TensorFlow to show how the above concepts work in the codes that have one layer and show two time steps:

import tensorflow as tf 

n_features = 1000
n_neurons = 100

x1 = tf.placeholder(tf.float32, [None, n_features])
x2 = tf.placeholder(tf.float32, [None, n_features])

Wx = tf.Variable(tf.random_normal(shape=[n_features, n_neurons], dtype=tf.float32))
Wa = tf.Variable(tf.random_normal(shape=[n_neurons, n_neurons], dtype=tf.float32))
ba = tf.Variable(tf.zeros([1, n_neurons], dtype=tf.float32))

a1 = tf.tanh(tf.matmul(x1, Wx) + ba)
a2 = tf.tanh(tf.matmul(a1, Wa) + tf.matmul(x2, Wx) + ba)

init = tf.global_variables_initializer()

Better Code with the dynamic_rnn() function

The dynamic_rnn() runs over the cell for the number of time steps (n_steps).

X = tf.placeholder(tf.float32, [None, n_steps, n_features])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

Mapping Input and Output Sequences

As I mentioned, RNNs work on the input and output of arbitrary lengths. Let’s take a look at the following cases:

[Many-to-Many]

This RNN takes a sequence of inputs and produces a sequence of outputs. For example, this type of RNN can be used for predicting stock prices. The input would be the prices over the last N days, and the output would be the prices over the N days shifted by one day from the input days.

[Many-to-One]

This takes a sequence of inputs and ignores all outputs except for the last one. The sentimental analysis could be the good example for this case. RNN takes a sequence of words from a product or movie review and predicts the sentiment (e.g., like or hate).

[One-to-Many]

This takes a single input and produces a sequence of outputs. A music generation could be the example of this case. The input could be genre (e.g., R&B, Jazz, etc.), and the output would be a sequence of notes.

[Many-to-Many, Encoder-Decoder]

This network takes the input sequence of arbitrary lengths in an encoder and produces the output sequence of arbitrary lengths in a decoder. This network is also called Encoder-Decoder. The encoder takes the input sequence, outputs a single vector representation for the input, and then the decoder produces the output sequence from the vector representation. For example, this network can be used for the language translation. The input would be a sentence in a language, and the output would be the translated sentence in another languate.