# Recurrent Neural Networks

## Introduction

On previous forward neural networks, our output was a function between the current input and a set of weights. On recurrent neural networks(RNN), the previous network state is also influence the output, so recurrent neural networks also have a "notion of time". This effect by a loop on the layer output to it's input.

In other words, the RNN will be a function with inputs (input vector) and previous state . The new state will be .
The recurrent function, , will be fixed after training and used to every time step.

Recurrent Neural Networks are the best model for regression, because it take into account past values.

RNN are computation "Turing Machines" which means, with the correct set of weights it can compute anything, imagine this weights as a program.
Just to not let you too overconfident on RNN, there is no automatic back-propagation algorithms, that will find this "perfect set of weights".

#### Use cases of recurrent neural networks

• Machine translation (English --> French)
• Speech to text
• Market prediction
• Scene labelling (Combined with CNN)
• Car wheel steering. (Combined with CNN)

### Implementing Vanilla RNN on python

Bellow we have a simple implementation of RNN recurrent function: (Vanilla version)

The code that calculate up to the next state looks like this:

def rnn_step_forward(x, prev_h, Wx, Wh, b):
# We separate on steps to make the backpropagation easier
#forward pass in steps
# step 1
xWx = np.dot(x, Wx)

# step 2
phWh = np.dot(prev_h,Wh)

# step 3
# total
affine = xWx + phWh + b.T

# step 4
next_h = np.tanh(t)

# Cache iputs, state, and weights
# we are having prev_h.copy() since python params are pass by reference.
cache = (x, prev_h.copy(), Wx, Wh, next_h, affine)

return next_h, cache


Observe that in our case of RNN we are now more interested on the next state, not exactly the output,

Before we start let's just make explicit how to backpropagate the tanh block.

Now we can do the backpropagation step (For one single time-step)

def rnn_step_backward(dnext_h, cache):
(x, prev_h, Wx, Wh, next_h, affine) = cache

#backward in step
# step 4
# dt delta of total
# Gradient of tanh times dnext_h
dt = (1 - np.square(np.tanh(affine))) * (dnext_h)

# step 3
dxWx = dt
dphWh = dt
db = np.sum(dt, axis=0)

# step 2
# Gradient of the mul block
dWh = prev_h.T.dot(dphWh)
dprev_h = Wh.dot(dphWh.T).T

# step 1
# Gradient of the mul block
dx = dxWx.dot(Wx.T)
dWx = x.T.dot(dxWx)

return dx, dprev_h, dWx, dWh, db


A point to be noted is that the same function and the same set of parameters will be applied to every time-step.

A good initialization for the RNN states is zero. Again this is just the initial RNN state not it's weights.

These looping feature on RNNs can be confusing first but actually you can think as a normal neural network repeated(unrolled) multiple times. The number of times that you unroll can be consider how far in the past the network will remember. In other words each time is a time-step.

### Forward and backward propagation on each time-step

From the previous examples we presented code for forward and backpropagation for one time-step only. As presented before the RNN are unroled for each time-step (finite). Now we present how to do the forward propagation for each time-step.

def rnn_forward(x, h0, Wx, Wh, b):
"""
Run a vanilla RNN forward on an entire sequence of data. We assume an input
sequence composed of T vectors, each of dimension D. The RNN uses a hidden
size of H, and we work over a minibatch containing N sequences. After running
the RNN forward, we return the hidden states for all timesteps.

Inputs:
- x: Input data for the entire timeseries, of shape (N, T, D).
- h0: Initial hidden state, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)

Returns a tuple of:
- h: Hidden states for the entire timeseries, of shape (N, T, H).
- cache: Values needed in the backward pass
"""

# Get shapes
N, T, D = x.shape
# Initialization
h, cache = None, None
H = h0.shape[1]
h = np.zeros((N,T,H))

# keeping the inital value in the last element
# it will be overwritten
h[:,-1,:] = h0
cache = []

# For each time-step
for t in xrange(T):
h[:,t,:], cache_step = rnn_step_forward(x[:,t,:], h[:,t-1,:], Wx, Wh, b)
cache.append(cache_step)

# Return current state and cache
return h, cache

def rnn_backward(dh, cache):
"""
Compute the backward pass for a vanilla RNN over an entire sequence of data.

Inputs:
- dh: Upstream gradients of all hidden states, of shape (N, T, H)

Returns a tuple of:
- dx: Gradient of inputs, of shape (N, T, D)
- dh0: Gradient of initial hidden state, of shape (N, H)
- dWx: Gradient of input-to-hidden weights, of shape (D, H)
- dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
- db: Gradient of biases, of shape (H,)
"""
dx, dh0, dWx, dWh, db = None, None, None, None, None
# Get shapes
N,T,H = dh.shape
D = cache[0][0].shape[1] # D taken from x in cache

# Initialization keeping the gradients with the same shape it's respective inputs/weights
dx, dprev_h = np.zeros((N, T, D)),np.zeros((N, H))
dWx, dWh, db = np.zeros((D, H)), np.zeros((H, H)), np.zeros((H,))
dh = dh.copy()

# For each time-step
for t in reversed(xrange(T)):
dh[:,t,:]  += dprev_h # updating the previous layer dh
dx_, dprev_h, dWx_, dWh_, db_ = rnn_step_backward(dh[:,t,:], cache[t])
# Observe that we sum each time-step gradient
dx[:,t,:] += dx_
dWx += dWx_
dWh += dWh_
db += db_

dh0 = dprev_h

return dx, dh0, dWx, dWh, db


Bellow we show a diagram that present the multiple ways that you could use a recurrent neural network compared to the forward networks. Consider the inputs the red blocks, and the outputs the blue blocks.

• One to one: Normal Forward network, ie: Image on the input, label on the output
• One to many(RNN): (Image captioning) Image in, words describing the scene out (CNN regions detected + RNN)
• Many to one(RNN): (Sentiment Analysis) Words on a phrase on the input, sentiment on the output (Good/Bad) product.
• Many to many(RNN): (Translation), Words on English phrase on input, Portuguese on output.
• Many to many(RNN): (Video Classification) Video in, description of video on output.

### Stacking RNNs

Bellow we describe how we add "depth" to RNN and also how to unroll RNNs to deal with time.
Observe that the output of the RNNs are feed to deeper layers, while the state is feed for dealing with past states.

### Simple regression example

Here we present a simple case where we want the RNN to complete the word, we give to the network the characters h,e,l,l , our vocabulary here is [h,e,l,o]. Observe that after we input the first 'h' the network want's to output the wrong answer (right is on green), but near the end, after the second 'l' it want's to output the right answer 'o'. Here the order that the characters come in does matter.

### Describing images

If you connect a convolution neural network, with pre-trained RNN. The RNN will be able to describe what it "see" on the image.

Basically we get a pre-trained CNN (ie: VGG) and connect the second-to-last FC layer and connect to a RNN. After this you train the whole thing end-to-end.

### Long Short Term Memory networks(LSTM)

LSTM provides a different recurrent formula , it's more powefull than vanilla RNN, due to it's complex that add "residual information" to the next state instead of just transforming each state. Imagine LSTM are the "residual" version of RNNs.

In other words LSTM suffer much less from vanishing gradients than normal RNNs. Remember that the plus gates distribute the gradients.

So by suffering less from vanishing gradients, the LSTMs can remember much more in the past. So from now just use LSTMs when you think about RNN.

Also in other words LSTM are better to remember long term dependencies.

Observe from the animation bellow how hast the gradients on the RNN disappear compared to LSTM.

The vanishing problem can be solved with LSTM, but another problem that can happen with all recurrent neural network is the exploding gradient problem.

To fix the exploding gradient problem, people normally do a gradient clipping, that will allow only a maximum gradient value.

This highway for the gradients is called Cell-State, so one difference compared to the RNN that has only the state flowing, on LSTM we have states and the cell state.

### LSTM Gate

Doing a zoom on the LSTM gate. This also improves how to do the backpropagation.

Code for lstm forward propagation for one time-step

def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
N,H = prev_c.shape

#forward pass in steps
# step 1: calculate activation vector
a = np.dot(x, Wx) + np.dot(prev_h,Wh) + b.T

# step 2: input gate
a_i = sigmoid(a[:,0:H])

# step 3: forget gate
a_f = sigmoid(a[:,H:2*H])

# step 4: output gate
a_o = sigmoid(a[:,2*H:3*H])

# step 5: block input gate
a_g= np.tanh(a[:,3*H:4*H])

# step 6: next cell state
next_c = a_f * prev_c +  a_i * a_g

# step 7: next hidden state
next_h = a_o * np.tanh(next_c)

# we are having *.copy() since python params are pass by reference.
cache = (x, prev_h.copy(), prev_c.copy(), a, a_i, a_f, a_o, a_g, next_h, next_c, Wx, Wh)

return next_h, next_c, cache


Now the backward propagation for one time-step

def lstm_step_backward(dnext_h, dnext_c, cache):
(x, prev_h, prev_c, a, a_i, a_f, a_o, a_g, next_h, next_c, Wx, Wh) = cache
N,H = dnext_h.shape
da = np.zeros(a.shape)

# step 7:
dnext_c = dnext_c.copy()
dnext_c += dnext_h * a_o * (1 - np.tanh(next_c) ** 2)
da_o = np.tanh(next_c) * dnext_h

# step 6:
da_f    = dnext_c * prev_c
dprev_c = dnext_c * a_f
da_i    = dnext_c * a_g
da_g    = dnext_c * a_i

# step 5:
da[:,3*H:4*H] = (1 - np.square(a_g)) * da_g

# step 4:
da[:,2*H:3*H] = (1 - a_o) * a_o * da_o

# step 3:
da[:,H:2*H] = (1 - a_f) * a_f * da_f

# step 2:
da[:,0:H] = (1 - a_i) * a_i * da_i

# step 1:
db = np.sum(da, axis=0)
dx = da.dot(Wx.T)
dWx = x.T.dot(da)
dprev_h = da.dot(Wh.T)
dWh = prev_h.T.dot(da)

return dx, dprev_h, dprev_c, dWx, dWh, db


### GRU (Gated Recurrent Unit) Cells

The Gru cells can be considered as a variant of the LSTM (Also want's to fight vanishing gradients) cell, but more computational efficient. On this cell the forget and input gates are merged (update gate).