How to minimize Cost function in Neural Networks?


To learn about Artificial Neural Network, Click Here

The goal of a Neural Network is to minimize the cost function. Lower the cost function, closer the predicted value (ลท) is to the original value (y).

Now, cost functions can be minimized by adjusting the weights. So, here we are going to 
learn how the weights are adjusted.

Brute force approach: here we take a lots of different weights and select the 
one which gives the lowest cost function.
Lets look at the graph below:
This is a very simple approach. We set a bunch of weights and calculate the cost function
for each one of them. The one with the lowest cost function is the best weight for the 
given neural network.

Problems:
Curse of Dimensionality
Suppose we have a very high-dimensional dataset, and all these dimensions (features) 
are fitted to the 'n' numbers of neurons. That means the number of synapses will be a lot, 
much much higher, as opposed to the one we learned in Artificial Neural Network where 
only those features are fitted which are deemed important by the neurons. This process 
will take a lot of time to process.
So, brute force method is not a good idea.

Solution to the Curse of Dimensionality: "Gradient Descent"

Gradient Descent is a faster way to find the best cost function.
Lets see whats happening in the pictures above:
Figure 1, lets say we start from that position (red ball). From there we will look at the angle
of our cost function, and find the slope. If the slope is negative, then move the red ball 
towards the negative slope (downhill). Then the red ball reaches position 2.
Figure 2, again we will look at the angle and move the ball towards the negative slope, i.e.
downhill, which takes us to position 3.
We continue with the same process until we reach the best position, i.e. lowest cost 
function.

How is this faster than the brute force method?
In brute force method, we try out each and every weights and then calculate the cost 
function every time. Finally the weight with the lowest cost function is selected as the best.
Since, it tries out a lot of weights, processing time is very high.
Now, in gradient descent we are only interested at the angle. If the slope is negative we go 
downhill, ignoring the positive slope. 
From the image above, brute force method used 25 weights, i.e. processing was done for 25
steps. But, in gradient descent we needed only 4 steps to find the best cost function.
Problems of Gradient Descent:What if our cost function looks like this?
So what will happen if we try to apply the normal gradient descent is that it will find the 
local minimum of the cost function instead of the global one. 
Gradient descent found the wrong one, so we do not have the correct weights or the 
optimized neural network.

Solution: "Stochastic Gradient Descent"Unlike gradient descent where we take the whole data together and fit it to the neurons, 
and then the cost function is calculated as a whole. This is also known as batch gradient
descent. 
But in stochastic gradient descent, we take one row at a time. Then the weights are 
adjusted after every row and move onto to the next row.
SGD is much more efficient and faster than batch gradient descent, as it does not have to 

wait for the whole dataset to load and run for every iteration. It just picks one row at a time
and calculates the cost function.

Comments

Popular posts from this blog

A/B Testing: Bernoulli Model VS the Beta-Binomial Hierarchical Model

Exploratory Data Analysis and Hypothesis Testing - Loan Prediction

Recurrent Neural Networks and LSTM explained