Part 2: Understanding Forward Propagation and Backward Propagation in Neural Networks

Raghav Agarwal
6 min readJun 24, 2021

--

This blog covers the aspects of training a Neural Network, the following contents are covered in this blog,

#Index1. Training a Single Neural Network.2. Training a Multiple Neural Network.3. Visualization of training on data set created at part-1

Training a Single Neural Network

A single neural network is defined as one activation function with input as a d-dimensional vector ,which will make space for d-dimensional weights and in the training , the loss is minimized/maximized.

1. Defining Loss Function

For simplicity , lets take the loss function to be mean-square plus regularization factor.

2. Defining Optimization problem

3. Solving Optimization problem

To solve this problem , we use the same SGD approach to minimize the loss and update the weights accordingly.

#Basic Algorithm:1. Initialize the weights
2. Calculate Del W
3. for iter 1 to k
W_new = W_old - n*{Del_W}
if w_new is similar to w_old:
Stop Iter

Computation of Del_W_L

Our weight vector and del-weight vector looks likes this ,

To compute Del_W_L , we need to compute dL/dWi , and following that Del_W_L , will be generated.

For simplicity ,

- Ignore the regularization parameter 
- f(x) = x .. (identity function)

This is how a Single Neural Network is trained in real time, what follows is a Multi layered Neural Network which is a next step to the concept we discussed.

Understanding the Notation of Multi Layered NN

Descriptions

f_ij  : This is the activation function to the unit
O_ij : This is the Output of the corresponding f_ij
W_Kij : Weight related to the connection from prev to forward layer.
W_Ijk : This is the weight matrix for each layer

Note : Its a good exercise to fill all the weights , functions and output values to get the grasp of the notation , because to understand the training part, these notations should not be a hurdle.

Training a Multi-Layered Neural Network

Looking at the above ML-NN , we have three weight matrixes(W_Ijk) , lets take each weight matrix ,and see how the update of the weights are done in each matrix.

#Just a Note
#Flow of understanding :
1. See the weight matrix
2. See the Network corresponding to the Weight matrix.
3. For each matrix ,del_W_Kij for each W_Kij , the method/chain-rule will be same , understanding one will make the understanding of others simple.
4. See the whole training algorithm and then connect the dots in the whole methodology.

1. Third Layer

Weight matrix for third layer
Chain rule for third layer

Looking at the diagram above , the del for the circled weight is as follows,

This is very much similar to one layer NN , as the chain-rule approach is the same.

2. Second Layer

Weight matrix for Second layer
Chain Rule for Second rule

Looking at the diagram above , the del for the circled weight is as follows,

3. First Layer

Weight matrix for first layer
Chain Rule for First Layer

Looking at the diagram above , the del for the circled weight is as follows,

Looking at this approach , we will get our Del for each corresponding weight matrix , lets see the whole algorithm now.

#Algorithm for training a NND = {xi , yi}1. Initialize Weights2. For each xi in D:
a) Pass xi forward through the network.
... (Forward Propogation)
b) Compute Loss(yi' , yi)
c) Compute all the derivatives and memoization
... (Memoization : its when the dels which are already computer are stored and are not computed again, similar like dynamic programming)
d) Update weights from end of the network to the start.
3. Repeat Step 2 for convergence W_Kij_old ~~ W_Kij_new
Visualization of How Gradient Descent Works in NN (ref)

Optional: Visualization of training on data set created at part-1.

After understanding training part it is very nice to see an actual model getting trained , as this blog is a follow up, to make things interesting , we’ll see the training in a visual manner.

1. Defining model

Model: “sequential_9” _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_49 (Dense) (None, 720) 2160 _________________________________________________________________ dense_50 (Dense) (None, 420) 302820 _________________________________________________________________ dense_51 (Dense) (None, 300) 126300 _________________________________________________________________ dense_52 (Dense) (None, 80) 24080 _________________________________________________________________ dense_53 (Dense) (None, 20) 1620 _________________________________________________________________ dense_54 (Dense) (None, 1) 21 ================================================================= Total params: 457,001 Trainable params: 457,001 Non-trainable params: 0 _________________________________________________________________

This is a “summary” which tf gives us about the model , to give a intuitive look of network, our model looks like this

Intuitive Diagram of NN

2. Training and Visualization of the model

Following Part-1 ,these points are actually , the data after T-SNE of TF-IDF vectorizer and how the model is training itself and giving the boundary to this data.

To understand more about this , follow this Colab Link

Summary

Understanding this concept is crucial to understand any Deep-Learning model ,hope this blog gave a mathematical foundation to this concept.

My LinkedIn

Link to PART-1

Understanding the Base

This brings the completion of both part-1 and part-2 , this cycle gives a foundation to anyone who wants to understand text classification , this provides base for “Word-Embeddings” , “Recurrent-NN” , “Transformers” , “Bert” and many other things.

Any suggestions are welcome , do leave them in the comment section.

Thank you !

--

--

Raghav Agarwal

Data Scientist with experience at Microsoft, Cummins, and George Washington University.