Part 2: Understanding Forward Propagation and Backward Propagation in Neural Networks

6 min readJun 24, 2021

This blog covers the aspects of training a Neural Network, the following contents are covered in this blog,

#Index1. Training a Single Neural Network.2. Training a Multiple Neural Network.3. Visualization of training on data set created at part-1

Training a Single Neural Network

A single neural network is defined as one activation function with input as a d-dimensional vector ,which will make space for d-dimensional weights and in the training , the loss is minimized/maximized.

1. Defining Loss Function

For simplicity , lets take the loss function to be mean-square plus regularization factor.

2. Defining Optimization problem

3. Solving Optimization problem

To solve this problem , we use the same SGD approach to minimize the loss and update the weights accordingly.

#Basic Algorithm:1. Initialize the weights
2. Calculate Del W 3. for iter 1 to k
     W_new = W_old - n*{Del_W} 
     if w_new is similar to w_old: 
         Stop Iter

Computation of Del_W_L

Our weight vector and del-weight vector looks likes this ,

To compute Del_W_L , we need to compute dL/dWi , and following that Del_W_L , will be generated.

For simplicity ,

- Ignore the regularization parameter 
- f(x) = x .. (identity function)

This is how a Single Neural Network is trained in real time, what follows is a Multi layered Neural Network which is a next step to the concept we discussed.

Understanding the Notation of Multi Layered NN

Descriptions

f_ij  : This is the activation function to the unit
O_ij  : This is the Output of the corresponding f_ij
W_Kij : Weight related to the connection from prev to forward layer.
W_Ijk : This is the weight matrix for each layer 
Note : Its a good exercise to fill all the weights , functions and output values to get the grasp of the notation , because to understand the training part, these notations should not be a hurdle.

Training a Multi-Layered Neural Network

Looking at the above ML-NN , we have three weight matrixes(W_Ijk) , lets take each weight matrix ,and see how the update of the weights are done in each matrix.

#Just a Note
#Flow of understanding :1. See the weight matrix
2. See the Network corresponding to the Weight matrix.
3. For each matrix ,del_W_Kij for each W_Kij , the method/chain-rule will be same , understanding one will make the understanding of others simple.
4. See the whole training algorithm and then connect the dots in the whole methodology.

1. Third Layer

Looking at the diagram above , the del for the circled weight is as follows,

This is very much similar to one layer NN , as the chain-rule approach is the same.

2. Second Layer

Looking at the diagram above , the del for the circled weight is as follows,

3. First Layer

Looking at the diagram above , the del for the circled weight is as follows,

Looking at this approach , we will get our Del for each corresponding weight matrix , lets see the whole algorithm now.

#Algorithm for training a NND = {xi , yi}1. Initialize Weights2. For each xi in D:
    a) Pass xi forward through the network. 
                                   ... (Forward Propogation)
    b) Compute Loss(yi' , yi)
    c) Compute all the derivatives and memoization
                                   ... (Memoization : its when the dels which are already computer are stored and are not computed again, similar like dynamic programming)
    d) Update weights from end of the network to the start.3. Repeat Step 2 for convergence    W_Kij_old ~~ W_Kij_new

Visualization of How Gradient Descent Works in NN (ref)

Optional: Visualization of training on data set created at part-1.

After understanding training part it is very nice to see an actual model getting trained , as this blog is a follow up, to make things interesting , we’ll see the training in a visual manner.

1. Defining model

Model: “sequential_9” _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_49 (Dense) (None, 720) 2160 _________________________________________________________________ dense_50 (Dense) (None, 420) 302820 _________________________________________________________________ dense_51 (Dense) (None, 300) 126300 _________________________________________________________________ dense_52 (Dense) (None, 80) 24080 _________________________________________________________________ dense_53 (Dense) (None, 20) 1620 _________________________________________________________________ dense_54 (Dense) (None, 1) 21 ================================================================= Total params: 457,001 Trainable params: 457,001 Non-trainable params: 0 _________________________________________________________________

This is a “summary” which tf gives us about the model , to give a intuitive look of network, our model looks like this

2. Training and Visualization of the model

Following Part-1 ,these points are actually , the data after T-SNE of TF-IDF vectorizer and how the model is training itself and giving the boundary to this data.

To understand more about this , follow this Colab Link

Summary

Understanding this concept is crucial to understand any Deep-Learning model ,hope this blog gave a mathematical foundation to this concept.

My LinkedIn

Raghav Agarwal - Data Engineer - Cummins India | LinkedIn

Data Scientist/ML-Engineer with strong math and computer science background, have practical experience of deploying and…

www.linkedin.com

Link to PART-1

PART-1 : Text-Classification techniques on subtitles of YouTube.

This blog focuses on various types of text-classifications techniques on YouTube subtitles in Machine Learning

raghav1999agg.medium.com

Understanding the Base

This brings the completion of both part-1 and part-2 , this cycle gives a foundation to anyone who wants to understand text classification , this provides base for “Word-Embeddings” , “Recurrent-NN” , “Transformers” , “Bert” and many other things.

Any suggestions are welcome , do leave them in the comment section.

Thank you !

Part 2: Understanding Forward Propagation and Backward Propagation in Neural Networks

Training a Single Neural Network

1. Defining Loss Function

2. Defining Optimization problem

3. Solving Optimization problem

Understanding the Notation of Multi Layered NN

Training a Multi-Layered Neural Network

1. Third Layer

2. Second Layer

3. First Layer

Optional: Visualization of training on data set created at part-1.

1. Defining model

2. Training and Visualization of the model

Summary

My LinkedIn

Raghav Agarwal - Data Engineer - Cummins India | LinkedIn

Data Scientist/ML-Engineer with strong math and computer science background, have practical experience of deploying and…

Link to PART-1

PART-1 : Text-Classification techniques on subtitles of YouTube.

This blog focuses on various types of text-classifications techniques on YouTube subtitles in Machine Learning

Understanding the Base

Written by Raghav Agarwal