0:00 okay so this lecture is going to be
0:02 about back propagation in CNN
0:04 architecture before we get started with
0:07 this topic there are two things that I
0:09 want to point out here first is that
0:11 since we are learning the concept of
0:13 back propagation in CNN or since we are
0:16 going through the topic of CNN then I am
0:19 just assuming that you are already aware
0:21 of the concept of back propagation in
0:24 ANS and machine learning so I am
0:26 assuming that you have a basic idea
0:28 around how the parameter training
0:30 happens in a machine learning algorithm
0:32 or in an architecture like artificial
0:35 neural network and you have a decent
0:37 understanding around gradient descent as
0:40 well so in case you are not comfortable
0:42 around these things then you can
0:44 consider checking out the back
0:45 propagation lectures from the deep
0:47 learning and machine learning playlist
0:48 and then you can come back to this
0:51 particular video so that it makes more
0:53 sense to you and the second thing that I
0:55 want to point out here is that this back
0:58 propagation thing in CNN is something
1:00 that you will not have to be bothered
1:03 about when you are working in a CNN
1:05 project or or I should say when you are
1:07 using a convolutional neural network
1:09 architecture your code will handle
1:11 everything over here but it is just that
1:13 being a good data scientist you should
1:15 have some clarity that what is actually
1:17 happening behind the code execution and
1:20 this is why for your general
1:21 understanding and from the interview
1:23 perspective as well it is important that
1:26 you should have some idea around back
1:28 propagation in CNN architecture as well
1:31 this should not look like a black box to
1:33 you okay so keeping these two things in
1:35 the mind let's go ahead and proceed with
1:37 the topic
1:42 [Music]
1:45 so if you have been following in
1:48 sequence then so far going by the
1:51 previous lectures of CNN you should be
1:53 already comfortable around convolutional
1:56 layer and you should have Clarity that
1:58 what are featured Maps we also discussed
2:01 around pooling different types of
2:03 pooling like Max pooling mint cooling
2:05 average pooling Etc and exactly how
2:08 after applying the pooling method the
2:10 featured map gets reduced in size then
2:13 we apply another convolutional layer and
2:16 then we apply another pooling if
2:17 required you can add or remove the
2:19 number of layers for convolution
2:21 operation and pulling both so there is
2:24 no concrete rule around that you should
2:26 have at least this number of
2:27 convolutional or pooling layers so it is
2:30 completely your choice we also
2:31 understood that what is the meaning of
2:34 the flattened layer where basically what
2:36 we do is we take each and every row of
2:38 pixel that we have within the image and
2:41 then we stack it on top of each other
2:45 in order to create the input column this
2:48 is the thing which happens in flattened
2:50 layer and then we pass this information
2:51 to the fully connected layer which is
2:54 the a n attached to the end of this CNN
2:57 architecture and by the end you get the
3:00 classes with probability values in order
3:02 to do the classification of the object
3:05 within the image So within last three to
3:07 four lectures we have discussed around
3:09 all these topics we have also discussed
3:12 about the concept of padding and strides
3:15 as well so if you are understanding
3:16 around all these things are absolutely
3:19 clear then we can go ahead and discuss
3:22 about back propagation now so first of
3:24 all I will quickly erase all this to
3:27 make it more clear and in order to
3:29 understand the back propagation in CNN I
3:32 am going to create a simpler version of
3:35 the CNN architecture so let's say that
3:38 we have an image here and the size of
3:41 the image is six by six and then we have
3:45 a filter or kernel Matrix so it is
3:48 filter or you can call it kerna and this
3:51 kernel Matrix has a size of three by
3:54 three so basically we are going to use
3:56 this filter to convolve over the top of
3:59 the image this is something that you
4:01 should be anyways comfortable with going
4:03 by the previous lectures as we have
4:05 already discussed this and after doing
4:07 the convolutional operation the next
4:10 thing that you will get will be a
4:11 featured map and since the image is in a
4:14 size of 6x6 and the filter we have in a
4:17 size of three by three going by the
4:20 formula that we have discussed earlier
4:22 which was M minus n multiplied by m
4:26 minus n plus 1 actually M representing
4:29 the size of the image M by m and n
4:32 representing the size of the kernel
4:36 Matrix which is n by n this feature map
4:38 will be in a size of 4 by 4 and since
4:41 over here we are considering only one
4:44 filter this is why it you can also call
4:46 it in a size of four by four by one so
4:49 let's assume that over here instead of
4:52 having only one filter you have two more
4:55 filters like this okay of the same size
4:58 in that case over here you will have two
5:01 more feature Maps like this and the size
5:03 of the feature map will become four by
5:05 four by three but anyways since for the
5:09 Simplicity purpose we are considering
5:11 only one filter so we are going to have
5:13 only one feature map so I am erasing all
5:16 this after achieving this feature map
5:18 the next thing that we will do is we
5:20 will apply value which is a simple
5:22 operation of converting all the negative
5:25 values into zero nothing else so
5:27 obviously this will do no change with
5:29 the size of the future map and the size
5:32 remains four by four and as you already
5:34 May guess that the next step will be
5:36 applying pooling operation and after
5:39 applying the cooling operation as you
5:41 know the size of the featured map will
5:43 be reduced so it's shrinks to two by two
5:46 so basically we are taking this image
5:48 then we are applying the filter to do
5:51 the convolutional operation in order to
5:53 achieve this featured map feature map
5:55 that will have the features from this
5:58 input image then we apply value
6:00 operation on top of this feature map to
6:02 achieve this one and then we apply the
6:05 pooling operation to get a featured map
6:07 which is reduced in size so basically in
6:10 this architecture we have only one
6:12 convolutional layer and only one cooling
6:15 layer and since this featured map is
6:18 reduced in size and it is in a size of
6:20 2x2 which means it is basically an image
6:24 of extremely low resolution that has
6:26 only 4 pixels so in Practical obviously
6:30 you will not work with images of this
6:32 size but I can try to understand I am
6:34 keeping this entire thing simple just to
6:36 give you a better explanation nothing
6:38 else and the next step will be to create
6:40 the flattened layer so basically what we
6:43 need to do is we will take this image or
6:46 the future map and first we will take
6:48 the first row of pixel and place it like
6:51 this so this will have two values then
6:53 we will take the second row of pixel and
6:55 we will place it on bottom of this and
6:58 this will also have two values and this
7:01 entire thing becomes the input of the
7:03 fully connected neural network towards
7:06 the end of this CNN architecture so I
7:09 will write it as a n and again please do
7:12 not consider this as a single neuron
7:14 this circle at the end is representing
7:17 an entire neural network so basically
7:21 this a m thing that I have drawn towards
7:24 the end is representing this a neural
7:27 network that will have four neurons or
7:30 perceptrons in the input layer because
7:32 we have four values in the input over
7:35 here then you will have number of hidden
7:37 layers you can choose that and finally
7:39 we will have one output layer and this
7:42 particular a n thing is representing
7:44 this whole architecture so I try to
7:47 explain you this thing so that you don't
7:49 have questions in your mind that in the
7:51 input layer if we have four values then
7:54 how come we have only one perceptron
7:56 over here okay so this is actually not a
7:58 perceptron this is representing an
8:00 entire artificial neural network
8:02 architecture and now I am going to make
8:05 the screen a bit more clean by erasing
8:08 the unnecessary stuff right so this is a
8:12 very basic CNN architecture that we are
8:14 considering in order to understand the
8:17 back propagation operation so going by
8:19 your previous understanding of back
8:21 propagation in a n or in any other
8:24 machine learning algorithm where you
8:26 will have trainable parameters we
8:28 already know that the basic idea of back
8:31 propagation is nothing but propagating
8:34 backwards in your architecture in order
8:36 to adjust the value of your trainable
8:39 parameters so obviously over here we are
8:42 able to understand that how the forward
8:45 propagation is happening we take the
8:47 image and we take this filter or kernel
8:50 Matrix to do the convolutional operation
8:52 that gives us this feature map then we
8:55 apply value function to discard all the
8:58 negative values and replace them with
9:00 zeros here then we apply this pooling
9:03 operation in order to reduce the size of
9:06 the featured map and then we apply the
9:08 flattening method in order to have the
9:11 values as an input layer for the NN
9:13 architecture and once this in an
9:15 architecture gives you an output let's
9:18 consider that as y hat over there
9:20 obviously we will will use a loss
9:22 function in order to check that how
9:25 close we are in compared to the input
9:27 value the actual value and as a loss
9:30 function you can either go with binary
9:32 cross entropy in case you have only two
9:34 classes in your target column let's say
9:37 you are doing a classification between
9:38 cat and dog or you can also use soft Max
9:42 in case you have more than two classes
9:45 and this will be the entire idea of
9:48 forward propagation starting from the
9:51 image towards the end of the
9:52 architecture so as a next step back
9:55 propagation will happen in order to
9:57 adjust the values of all the trainable
10:00 parameters and before we understand
10:02 about that let's try to understand first
10:05 that how many trainable parameters we
10:07 actually have within this particular
10:09 architecture so the parameters for which
10:12 we need to adjust the values in order to
10:15 reduce the loss are first of all these
10:18 all the values within this filter so 3
10:20 by 3 which means nine values that you
10:23 need to adjust here and plus this filter
10:26 Matrix will also have a bias value a
10:29 scalar value okay so I am taking that as
10:32 well so 9 plus 1 additionally let's
10:35 check that where else we have trainable
10:37 parameters in this architecture so after
10:40 flattening this particular Matrix or
10:43 featured map when we acquire our input
10:46 values so for example we have four
10:48 values over here which is being passed
10:51 to the fully connected neural network or
10:53 the artificial neural network
10:54 architecture then obviously each and
10:57 every value will have a dedicated
10:59 weights assigned to it so it will have
11:01 W1 W2 W3 and W4 so we have four weights
11:07 over here plus let's also consider a
11:10 bias term over here as well so let's
11:12 take it as B2 and here we will take it
11:16 as B1 so we have five trainable
11:19 parameters towards the end as well so it
11:21 will be 10 plus 5 so in total we have 15
11:26 trainable parameters for which the
11:28 values needs to be adjusted in order to
11:31 reduce the loss that we have and since
11:34 now we have understood that how many
11:36 trainable parameters we have let's try
11:38 to understand the backward propagation
11:41 operation over here so basically I want
11:43 to keep this lecture in such a way that
11:45 this should be the only video that you
11:48 will ever need in order to understand
11:50 the back propagation in CNN so
11:53 definitely this is going to be a bit
11:55 lengthy so please have some patience and
11:57 stick to the end so before we move ahead
11:59 what I want to do is I want to
12:01 generalize this entire architecture okay
12:04 because over here it is looking very
12:06 complex and this might seem a bit
12:08 difficult if I'll try to explain you
12:10 back propagation on this particular
12:13 figure so let me place it over here on
12:16 the top and let's continue from here so
12:18 let's say that you have an image X and
12:21 you apply the convolutional operation on
12:23 top of this so let this be the symbol of
12:26 the convolutional operation which is
12:28 happening with the help of a kernel
12:31 Matrix or filter that has some weights
12:33 and that has a bias as well and after
12:37 doing this convolutional operation you
12:39 get a featured map and let's call that
12:41 X1 what is the next step then obviously
12:44 you apply the value operation on top of
12:47 this and let's say after applying this
12:49 rally operation you are getting R1 and
12:52 then the next step would be applying Max
12:54 pooling on top of this although you can
12:56 use any type of pooling but for the time
12:58 being let's consider we are using Max
13:00 pooling and then let's say you get P1
13:02 the feature map is now being called P1
13:05 what will be the next step then
13:07 obviously you will apply flattening and
13:10 this will give you the input layer let's
13:12 call that F this F will be passed on to
13:16 your artificial neural network
13:17 architecture or you can say the fully
13:20 connected neural network and let's say
13:22 that will give you an equation s which
13:24 will be then passed on to an activation
13:27 function like sigmoid that finally gives
13:30 you the output y hat so let's do one
13:33 thing first let's shrink the size of
13:36 this simple architecture a bit so that
13:38 it can fit well on the screen all right
13:40 so now this output value y can be used
13:43 in order to calculate the loss to check
13:46 that how well we are doing the
13:47 prediction now the way we have the
13:49 trainable parameters over here we also
13:52 have the trainable parameters over here
13:54 as well so if you remember what we
13:56 discussed on the top that when we are
13:58 passing on the input values to the fully
14:01 connected neural network here as well we
14:03 have four parameters and one bias value
14:06 right so I'm talking about that one so
14:08 let's call it as W2 and bias 2. so
14:12 remember that this W1 has 9 values
14:16 because it is a filter of three by three
14:18 size plus one bias value so in total
14:21 there are 10 trainable parameters and
14:24 over here we have four weights plus one
14:27 bias so these are the parameters for
14:30 which we need to adjust the values
14:31 during back propagation in order to
14:34 reduce this loss value or you can say in
14:38 order to do more accurate prediction
14:40 since the better you predict the Lesser
14:43 loss you will have right so basically we
14:46 want to update the values of these
14:49 parameters right and if you remember by
14:51 the previous lectures of deep learning
14:53 or machine learning we use the formula
14:56 of this gradient descent algorithm in
14:59 order to update these values so what
15:00 does the formula say let's say we want
15:02 to update the value for this parameter
15:05 W1 okay so the new value for this weight
15:10 will be equals to the old value of W1
15:14 minus a small learning rate multiplied
15:18 by DL by D W old now what does this mean
15:23 it simply means that how the value of
15:26 loss is changing by bringing a very
15:30 small change with the value of w old so
15:33 obviously at this point of time I am
15:34 assuming that you are anyways
15:37 comfortable with the extremely basic
15:39 calculus or the idea of derivatives so I
15:42 am not going to focus so much on that
15:44 and in exactly similar way we also
15:46 update the value of the bias as well
15:48 just like this okay so basically we are
15:52 looking for these new values okay and
15:54 let's see what we already have we have
15:56 the old values that we initiated
15:59 randomly okay that we have actually used
16:02 in order to do the front propagation in
16:04 order to achieve this output and to
16:07 calculate the loss as well we know the
16:09 value of this learning rate because this
16:11 is something that we have decided or
16:13 chosen by the algorithm so that is also
16:15 within our control all we need to find
16:18 is the value for this the derivative of
16:20 loss with respect to the previous value
16:22 or the old value of the weights so the
16:25 idea of back propagation is nothing but
16:27 calculating this particular thing okay
16:29 so I will erase these things first and
16:33 let me bring this entire portion to the
16:35 side over here okay okay so the next
16:38 thing that we are going to do is this
16:40 entire CNN architecture okay I want you
16:43 to assume it in two parts first will be
16:46 this the convolutional part and second
16:49 will be this where we have the fully
16:51 connected neural network and within the
16:54 previous lectures of this particular
16:56 deep learning playlist we have discussed
16:58 a lot about artificial neural networks
17:01 and we have also discussed about how the
17:03 weight adjustment or the values of the
17:06 weights are being adjusted in the
17:08 artificial neural network during the
17:10 back propagation so anyways you will
17:12 have a decent Clarity around that if you
17:14 are following along with this playlist
17:15 the challenging part over here to
17:17 understand is that how the back
17:20 propagation happens in the flattening
17:22 layer in the pooling layer also for the
17:24 rally operation and most importantly
17:26 within the convolutional layer how
17:29 exactly it is happening over here so
17:31 obviously this part is going to be the
17:33 Crux of this lecture but still I will
17:35 quickly go ahead and cover very quick
17:37 how the weight adjustment is happening
17:39 here so I'm talking about these weights
17:42 actually okay so basically what we want
17:44 we want to understand that how the loss
17:48 will be changed with respect to a change
17:51 with W2 okay or over here you can also
17:55 say P2 but let's consider W2 for the
17:57 timing and try to understand that W2 or
18:01 the weights for the W2 parameter is not
18:04 directly involved in order to calculate
18:07 the loss because using the values of W2
18:10 or B2 or this entire weights these five
18:13 weights first we are calculating this
18:15 equation s which is then being passed on
18:18 to the sigmoid function in order to
18:20 calculate the output which is y using
18:23 the value of y we are then calculating
18:25 the loss so now we will have to
18:27 propagate backwards in order to
18:29 understand that how we calculate DL by
18:32 dw2 so let's try to understand that so
18:36 obviously this is a point where we are
18:38 going to talk about chain rule so what
18:40 does the idea of chain rule say over
18:42 here first we will check that how the
18:44 value of loss Will Change by bringing a
18:47 small change in the value of y hat so
18:49 that will be DL by d y hat Next Step
18:55 will be how the value of d y hat is
18:58 being changed by bringing a change in s
19:01 so that will be d y Hat by d s and then
19:06 finally we will calculate that how the
19:09 value in s will be changed by changing
19:13 the values of W2 or B2 let's say anyways
19:16 both will be calculated in the same way
19:18 but for the explanation purpose I am
19:20 considering W2 for the time being so
19:23 this will be DS by D W 2 and applying
19:28 the method of chain room this will
19:30 cancel this and this will cancel out
19:32 this and this is how we calculate DL by
19:36 dw2 and by finding this particular
19:38 derivative what we basically try to
19:41 check is that what happens when we are
19:43 moving or changing the value of W2 in a
19:46 certain direction so let's say if we are
19:49 increasing using the value of W2 and
19:51 that is leading to the increasing value
19:53 of loss as well then obviously the
19:56 gradient descent formula will not update
19:58 the value of W2 by increasing it instead
20:02 it will go for the other approach by
20:04 trying to decrease the value of W2 and
20:07 it will check if the value of loss is
20:09 decreasing in that way this is how the
20:11 weight updation works anyways this is
20:13 something that we have already learned
20:15 in a lot of details when we were
20:17 discussing back propagation in a n but
20:19 still I thought of giving you this much
20:21 explanation for a revision purpose maybe
20:23 so that you should be feeling
20:25 comfortable within this lecture so
20:27 hopefully we are clear about the weight
20:30 updation part for this a n part of the
20:33 CNN architecture and we are able to
20:35 understand that how these weights are
20:37 being updated or adjusted by doing the
20:40 back propagation by calculating
20:42 different derivatives and applying chain
20:44 rule in order to achieve the optimum
20:47 weights for these parameters and now we
20:50 are going to understand step by step the
20:52 idea of back propagation within the
20:54 convolutional part of this CNN
20:56 architecture since we are propagating
20:58 backwards from the end and we have
21:01 understood how we are calculating or
21:04 updating these values which means we
21:06 have already understood that how to
21:08 calculate this particular term right now
21:11 let's talk about the updation of these
21:13 weights so again for the explanation
21:15 purpose I am considering only W1 to show
21:18 you how the weight updation happens and
21:20 the same method or the same trick will
21:22 be applied for B1 as well okay so
21:24 basically we want to understand that how
21:27 the loss is being changed with respect
21:29 to W1 okay so obviously coming back from
21:33 the point of loss we are already here at
21:36 this point where we have calculated or
21:38 we have understood how DL by dw2 is
21:42 being calculated now let's try to flow
21:44 backwards from this point okay so we
21:47 will further apply the chain room and we
21:49 need to calculate the derivative of W2
21:52 with respect to so from this point we
21:56 will now capture this one so P1 Next
21:59 Step will be to calculate the derivative
22:01 of P1 with respect to the derivative of
22:04 R1 so this one multiplied by let me
22:08 erase this one okay we will calculate
22:10 the derivative of R1 with respect to d x
22:14 1 and then finally
22:17 dx1 by D W1 again going by the chain
22:22 rule this will cancel out this this will
22:25 cancel out this and so on and we will
22:28 finally end up calculating that how loss
22:32 is changing with respect to bringing a
22:35 small change in W 1 which is this
22:37 weights W1 or B1 you can see because
22:40 obviously you can understand the way we
22:42 are trying to update the value of W1
22:44 using this chain rule method the value
22:47 for B1 as well will will be updated in
22:50 the similar way so let me make it
22:52 cleaner again since this part is all
22:55 clear for us we understand very well
22:57 that how the weight updation happens for
23:00 the an architecture let's talk about
23:03 this part so how do we calculate the
23:05 derivative for the flattening layer so
23:07 let's understand about back propagation
23:10 in flattened layer so previously if you
23:13 will check what we were doing exactly in
23:15 the flatting layer after applying the
23:17 pooling operation the size of the
23:19 featured map will be reduced from 4x4 to
23:23 2 by 2 and basically we were taking this
23:26 2 by 2 Matrix and we were flattening it
23:29 in order to have the input column or the
23:32 input values and during back propagation
23:34 we need to understand that how do we go
23:36 back from this step to this step right
23:40 so let's understand that so after
23:42 applying the pooling operation we had
23:45 this Matrix of size 2x2 that we were
23:48 calling as P or let me raise it from
23:51 here and write it here p and then what
23:54 we were basically doing is we were
23:56 applying the flattening method in order
23:59 to create this array that will have four
24:01 values to pass it on as an input to the
24:05 fully connected neural network and when
24:07 doing back propagation we do the exactly
24:10 opposite of this we take this input
24:12 column okay these four values which was
24:16 previously in a size of four by one that
24:18 means four rows and one column and we if
24:22 we store it back to the previous shape
24:24 of two by two like this that's it that's
24:28 all happens when we are doing back
24:30 propagation in flattening layer so let's
24:32 go to the previous explanation so
24:34 previously we knew that how we calculate
24:37 this P1 by applying Max pooling and we
24:40 flatten it in order to have the input
24:42 value but during back propagation we
24:45 restore the input values back to the
24:47 previous shape of two by two or whatever
24:49 the shape was previously so this is what
24:52 happens in the step of dw2 by dp1 okay
24:56 now let's move further in the discussion
24:58 of back propagation so starting from the
25:00 loss we came here to here we updated the
25:04 weights and we also understood that how
25:06 back propagation happens in the
25:08 flattening layer now it is time that we
25:10 understand that how back propagation is
25:13 happening in the pooling layer so let's
25:16 discuss over that now so just like the
25:18 flattened layer again we do not have any
25:20 trainable parameters here so we are not
25:23 gonna using any conventional back
25:24 propagation instead we will again do
25:27 exactly opposite of what we have done
25:30 previously during pooling or Max pooling
25:33 operation so if you remember the max
25:35 pooling operation in this way let's say
25:37 if you have a structure like this okay
25:41 let me draw it completely then I will
25:43 explain so let's say that you have an
25:45 image like this that has a size of four
25:48 by four and in total it has 16 pixels
25:51 then when you are applying Max pooling
25:54 basically what you are doing is you take
25:56 a window of two by two okay this could
25:59 be of any size but let's say two by two
26:01 and you start moving this window on top
26:04 of the image and for each and every
26:06 position let's say initially the window
26:09 is over here then specifically when you
26:11 are applying Max cooling you take the
26:13 maximum value out okay so from the first
26:16 position you will take four again you
26:19 move the pulling window to the left and
26:21 you take out eight similarly you place
26:24 it over here this time and you take out
26:27 12 and finally in the last position you
26:30 will be taking out the highest value
26:31 which is 16 okay this is more or less
26:36 the entire idea of Max pooling but when
26:39 you are doing back propagation on the
26:42 pulling layer you do exactly the
26:44 opposite of this particular operation so
26:46 let me tell you what exactly happens
26:48 here you take this Matrix okay and then
26:52 you move it backwards to the previous
26:54 shape and it is really important that
26:57 while doing this back propagation we
26:59 should have a note that from which
27:01 particular index we have taken the
27:03 maximum value like these ones okay
27:06 because when we are restoring the
27:08 information backwards in the previous
27:10 shape only that particular position from
27:13 where the maximum element has been taken
27:15 out only that position will be non-zero
27:18 and rest of the part we will keep it as
27:21 0 but let me tell you one thing again
27:23 that none of these things will be done
27:25 by you manually okay we have already
27:28 discussed this particular thing at the
27:30 beginning of this lecture that the
27:32 entire operation of back propagation
27:34 will be handled by your code only it is
27:38 just that you should have some
27:39 understanding that what is actually
27:41 happening behind the code execution so
27:44 that a typical CNN architecture should
27:47 not look like a black box to you it is
27:49 really important from the interview
27:50 perspective and this is how the back
27:53 propagation happens in the pooling layer
27:55 obviously there are no trainable
27:57 parameters all we are doing is we are
28:00 doing the reverse of the previous
28:01 operation so initially it was a four by
28:04 four Matrix and by applying the max
28:06 cooling operation it got converted to a
28:09 two by two Matrix during the back
28:11 propagation we take this 2 by 2 Matrix
28:13 and we store it back to the previous
28:15 shape in this way keeping the positions
28:18 from where the highest elements were
28:20 taken as non-zero that's it and with
28:24 that being said we have now also
28:26 understood that how this particular part
28:29 is being calculated so now we need to
28:32 understand the next step where we will
28:34 understand how the value operation is
28:37 being considered curing back propagation
28:39 so this one is going to be the most
28:41 simplest thing to understand okay see
28:43 what is happening within this rally
28:45 operation all we are doing is let's
28:47 assume over here we have a two by two
28:49 Matrix okay something that looks like
28:51 this let's say 2 minus 3 minus 1 and 6
28:56 then all we are doing is after applying
28:59 value on this particular Matrix there
29:01 will be no change in the shape of this
29:03 Matrix it is just that all the positive
29:05 values will remain same and the negative
29:08 values will be converted to zero that's
29:10 the basic thing that value operation is
29:13 doing for you this is first thing
29:14 secondly in order to calculate this ETL
29:18 by D W1 which means in order to
29:21 understand how the loss is changing with
29:25 respect to bringing a small change in W1
29:28 or P1 for say we calculated or we are
29:31 trying to apply this particular chain
29:33 rule right and we have understood that
29:36 how the calculation is happening till
29:39 this point so far okay so when you are
29:43 propagating back from the point of loss
29:45 to the point of value which means from
29:48 this particular point to this particular
29:50 point after propagating back towards
29:53 over here the output that you will have
29:56 that also is going to be a two by two
29:58 Matrix and let's say this particular two
30:00 by two Matrix looks like this let's say
30:03 1 3
30:05 minus 6 and 2 okay so basically over
30:10 here what you are trying to do is you
30:13 are trying to calculate dr1 by dx1 which
30:17 means the derivative of these values all
30:21 the values within this particular Matrix
30:23 right so you will try to calculate D1
30:27 with respect to
30:29 dw1 because we are checking for W1 right
30:32 and similarly we will try to calculate
30:34 the derivative for other elements as
30:36 well so the idea is going to be very
30:39 simple if you are doing the derivative
30:42 for a positive value like this this or
30:46 this you will always get one obviously
30:49 you know that going by the basics of
30:51 calculus otherwise you will get zero
30:54 that's it that's the whole idea behind
30:56 it okay so if anyone asks you in an
30:59 interview that during back propagation
31:02 in a CNN architecture how does the back
31:04 propagation happens in the value layer
31:07 then you simply say that within this
31:10 particular step exactly here where we
31:12 are calculating Dr by dx1 we are trying
31:16 to differentiate the values of R1 right
31:19 R1 is nothing but a 2 by 2 Matrix that
31:22 has four values and we are trying to
31:24 calculate the derivative of those four
31:26 values only and for those four values if
31:30 the values are positive then obviously
31:32 the derivative will be 1 otherwise the
31:35 derivative will be zero that's it and
31:38 finally since we are done till this
31:39 point in the back propagation Journey
31:41 now the last step is to calculate this
31:44 dx1 by dw1 so basically for this
31:50 particular filter okay where we are
31:52 using a three by three filter over here
31:54 so let's say for Simplicity purpose that
31:57 we are using a filter of 2 byte okay
32:00 that looks let's say something like this
32:02 simply like 0 0 1 and 1. okay so now
32:07 during back propagation we will try to
32:09 update the values of this particular
32:11 filter into any random number okay
32:13 although that will not be random the
32:16 values will be updated in such a way
32:18 that the loss should be minimum but the
32:20 idea is to have the values in such a way
32:24 that these values within the filter when
32:27 applied on the image during the
32:30 convolutional operation this should
32:32 derive some meaningful features that's
32:34 it so I'll try to explain what I mean
32:36 here let's consider a very simple use
32:38 case let's say that you have an image
32:40 okay in the image you have a round
32:43 object like this okay and then you have
32:46 a two by two filter that you are using
32:48 size 2x2 uh let me write it outside okay
32:53 2 by 2 and this will have random values
32:56 like X1 let's say X2 uh X3 or let's say
33:00 X4 then what's the idea of using this
33:03 particular filter we will convert the
33:05 filter on top of the image first at this
33:07 position next over here straight by
33:10 stride then over here then over here
33:12 right so after convolving the filter all
33:16 over the image we should be able to grab
33:18 some meaningful features or edges right
33:21 like some primitive features like this
33:24 like this this this and then towards the
33:28 end of the convolutional layer finally
33:30 we will be able to identify the object
33:32 like this okay but let's say let's say
33:36 the values of your filter are kept
33:39 randomly in such a way that it will
33:41 detect horizontal edges like this or
33:43 let's say vertical edges like this or
33:45 let's say slant edges like this then you
33:48 only can think that combining this kind
33:51 of edges you will never be able to form
33:53 a circle that you have on the image so
33:56 when you finally float towards the end
33:59 of the CNN architecture over here the
34:02 loss will be too much when you calculate
34:04 the loss using binary cross entropy or
34:07 let's say soft Max the loss will be very
34:09 high okay because the edges that you are
34:12 trying to capture with the help of this
34:14 particular filter those edges are not
34:17 really that meaningful in order to
34:19 identify the object within the image so
34:21 the back propagation happens exactly the
34:24 way we have learned this back
34:26 propagation happens and we adjust these
34:29 values in such a way that this time we
34:31 will try to capture another type of
34:33 edges maybe edges that instead of being
34:37 horizontal or vertical edges edges that
34:39 may look like this curved curved edges
34:42 like this primitive edges okay like this
34:45 in that case combining edges like this
34:47 and obviously let's say if we are
34:50 talking about detecting four edges okay
34:53 first like this second this third this
34:56 and fourth being like this okay in that
34:58 case you will need to have four filters
35:01 so you will have two by two size four
35:04 filters and you will apply them one by
35:06 one for the convolutional operation on
35:08 top of your image and by doing the back
35:10 propagation again and again the
35:13 algorithm finally succeeds to update
35:15 these values in such a way that they
35:18 will start capturing edges like this
35:20 okay this is the whole idea of doing the
35:24 back propagation in CNN finally you will
35:27 have these values let me make it a bit
35:29 cleaner this is looking very untidy so
35:32 in a CNN model when you are doing the
35:34 back propagation in order to reduce this
35:36 loss okay and we discussed all of this
35:39 how do we calculate this step then this
35:41 step then how do we back propagate on
35:43 top of flattening layer Max pooling
35:45 layer value layer and finally after
35:47 coming to this point the algorithm will
35:49 start updating these weights and biases
35:52 the way it was updating these weights
35:54 and biases okay finally in the end when
35:58 we are when we have achieved the best
36:00 values for these weights and biases
36:03 which is nothing but the best values for
36:05 the filter that we are using okay if we
36:08 are able to do that in fact we will not
36:10 be doing that the algorithm will do that
36:12 for us but finally we reached to the
36:15 point where we get the best values for
36:17 the filter and then the filter will
36:19 start capturing the meaningful edges or
36:21 features from the image and then finally
36:24 if you pass an image of a Tweety Bird
36:27 then since these filters have the best
36:30 values in order to capture the edges or
36:33 features like the eyes of fruity bird or
36:36 its eyebrows or head or other features
36:38 it will pass the information towards the
36:41 end of the architecture so that it can
36:44 confidently say that it is an image of a
36:46 treatable not of goofy or Donald Duck so
36:50 I hope this this lecture was able to
36:52 provide you some understanding a very
36:55 decent understanding around back
36:57 propagation in CNN in case you are
36:59 learning this topic for entirely first
37:02 time if CNN is completely new to you in
37:05 that case you may require to go through
37:08 this particular video maybe couple of
37:09 more times but please do not hesitate
37:12 doing that because initially I also
37:14 struggled a lot to understand this
37:16 particular topic and congratulations if
37:18 you have already understood it try to
37:20 share it with your friends or peers who
37:22 may find it useful and thank you very
37:24 much for watching till the end we have
37:26 learned a lot within this deep learning
37:28 playlist although it is becoming very
37:29 difficult to make time from my full-time
37:31 job to create tutorials like this but I
37:34 will try my best to create more
37:36 tutorials on recurrent neural network in
37:40 order to continue this deep learning
37:42 playlist so please subscribe to the
37:43 channel if you're new here and hopefully
37:45 I will see you in the next lecture
37:52 thank you